Consider the following xml fragment:
<Obj>
<Name><![CDATA[SomeText]]></Name>
</Obj>
How do I retrieve the "SomeText" value via XPath? I'm using Nauman Leghari's (excellent) Visual XPath tool.
/Obj/Name returns the element
/Obj/Name/text() returns blank
I don't think its a problem with the tool (I may be wrong) - I also read XPath can't extract CDATA (See last response in this thread) - which sounds kinda weird to me.
/Obj/Name/text() is the XPath to return the content of the CDATA markup.
What threw me off was the behavior of the Value property. For an XMLNode (DOM world), the XmlNode.Value property of an Element (with CDATA or otherwise) returns Null. The InnerText property would give you the CDATA/Text content.
If you use Xml.Linq, XElement.Value returns the CDATA content.
string sXml = #"
<object>
<name><![CDATA[SomeText]]></name>
<name>OtherName</name>
</object>";
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml( sXml );
XmlNamespaceManager nsMgr = new XmlNamespaceManager(xmlDoc.NameTable);
Console.WriteLine(#"XPath = /object/name" );
WriteNodesToConsole(xmlDoc.SelectNodes("/object/name", nsMgr));
Console.WriteLine(#"XPath = /object/name/text()" );
WriteNodesToConsole( xmlDoc.SelectNodes("/object/name/text()", nsMgr) );
Console.WriteLine(#"Xml.Linq = obRoot.Elements(""name"")");
XElement obRoot = XElement.Parse( sXml );
WriteNodesToConsole( obRoot.Elements("name") );
Output:
XPath = /object/name
NodeType = Element
Value = <null>
OuterXml = <name><![CDATA[SomeText]]></name>
InnerXml = <![CDATA[SomeText]]>
InnerText = SomeText
NodeType = Element
Value = <null>
OuterXml = <name>OtherName</name>
InnerXml = OtherName
InnerText = OtherName
XPath = /object/name/text()
NodeType = CDATA
Value = SomeText
OuterXml = <![CDATA[SomeText]]>
InnerXml =
InnerText = SomeText
NodeType = Text
Value = OtherName
OuterXml = OtherName
InnerXml =
InnerText = OtherName
Xml.Linq = obRoot.Elements("name")
Value = SomeText
Value = OtherName
Turned out the author of Visual XPath had a TODO for the CDATA type of XmlNodes. A little code snippet and I have CDATA support now.
MainForm.cs
private void Xml2Tree( TreeNode tNode, XmlNode xNode)
{
...
case XmlNodeType.CDATA:
//MessageBox.Show("TODO: XmlNodeType.CDATA");
// Gishu
TreeNode cdataNode = new TreeNode("![CDATA[" + xNode.Value + "]]");
cdataNode.ForeColor = Color.Blue;
cdataNode.NodeFont = new Font("Tahoma", 12);
tNode.Nodes.Add(cdataNode);
//Gishu
break;
CDATA sections are just part of what in XPath is known as a text node or in the XML Infoset as "chunks of character information items".
Obviously, your tool is wrong. Other tools, as the XPath Visualizer correctly highlight the text of the Name element when evaluating this XPath expression:
/*/Name/text()
One can also write a simple XSLT transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
"<xsl:value-of select="/*/Name"/>"
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<Obj>
<Name><![CDATA[SomeText]]></Name>
</Obj>
the correct result is produced:
"SomeText"
i think the thread you referenced says that the CDATA markup itself is ignored by XPATH, not the text contained in the CDATA markup.
my guess is that its an issue with the tool, the source code is available for download, maybe you can debug it...
See if this helps - http://www.zrinity.com/xml/xpath/
XPATH = /Obj/Name/text()
Just in case you run into a similar issue with jdom2, text() will be an array.
To recover CDATA, use /Obj/Name/text()
A suggestion would be to have another field of the md5 hash of the cdata. You can then use xpath to query based off the md5 with no issue
<sites>
<site>
<name>Google</name>
<url><![CDATA[http://www.google.com]]></url>
<urlMD5>ed646a3334ca891fd3467db131372140</urlMD5>
</site>
</sites>
Then you can search:
/sites/site[urlMD5=ed646a3334ca891fd3467db131372140]
Related
I am new to xpath so I apologize in advance for how basic this question is.
How do I extract just the text from a specific element? For example, how would I extract just "text"
<h1>text</h1>
I tried the following but it seems to select everything including the tags instead of just the text.
//h1/text()
Thanks for your help
`
DocumentBuilderFactory docFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new File("src/myFile.xml"));
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
String sessionId = (String) xpath
.evaluate(
"/Envelope/Body/LoginProcessResponse/loginResponse/sessionId",
doc, XPathConstants.STRING);
`
here Envelope is my parent element and i just traversed to the required path(in my case it is sessionid).
Hope it helps
This answer is rather an XSLT answer than an XPath answer, but many of the concepts are nevertheless applicable.
The XPath expression
//h1/text()
seems to be correct. It does select all text() nodes that are direct children of <h1> elements.
But one problem may be, that the XSL default template still copies all the othertext() nodes like described here in the W3C specification:
In the absence of a select attribute, the xsl:apply-templates instruction processes all of the children of the current node, including text nodes.
So to solve your problem, you have to define an explicit template that
ignores all other text() nodes like this:
<xsl:template match="text()" />
If you add this line to your XSL processing, the result will most likely be more pleasant to you.
I'm using Html Agility Pack to perform a basic web scraping of Google search results. As a newbie to XPath, I make sure my path expression is correct(with the help of FirePath). However, the returned HtmlNodeCollection is always NULL.
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmlDoc = web.Load("http://www.google.com/search?num=10&q=Hello+World");
// get search result URLs
var items = htmlDoc.DocumentNode.SelectNodes("//div[#id='ires']/ol[#id='rso']/li/div[#class='vsc']/h3/a/#href");
foreach (HtmlNode node in items)
{
Console.WriteLine(node.Attributes);
}
Am I missing something? Can anyone please enlighten me?
Thanks in advance,
HAP can only process the raw HTML that is returned from the url, it will not run any additional javascript that is on the page or whatnot. You need to adjust your query accordingly.
In the raw HTML, the ires div exists but the rso doesn't get inserted until the javascript is run hence you get no results. There are other transformations done here which you'll have to adjust for as well.
Here's a fragment of the HTML:
<div id="ires">
<ol>
<li class="g">
<h3 class="r">
...
A more appropriate xpath to use for this would be:
var xpath = "//li[contains(concat(' ',#class,' '),' g ')]" +
"/h3[contains(concat(' ',#class,' '),' r ')]" +
"/a/#href";
It'd be easier to find all li with the g class as those correspond to all the results. You'll want to filter all h3 with the r class otherwise you'd include other results (such as image results).
How can i retrieve all the valid XPATH from all these node?
----------------Sample XML---------------------
<name version="1.0">
<document>
<documentId>0107</documentId>
<NameDetail>
<firstname>SAM</firstname>
<internalreferenceNumber>12345</internalreferenceNumber>
</NameDetail>
<NameDetail>
<firstname>TECHNO</firstname>
<internalreferenceNumber>12346</internalreferenceNumber>
</NameDetail>
</document>
</name>
For the Above XML, the Output would be :
XPATH for name = "/name"
XPATH for documentId = "/document/documentId"
XPATH for firstname = ""/document/NameDetail[1]/firstname"
XPATH for firstname = "/document/NameDetail[2]/firstname"
QTP does not support extracting XPaths from XML documents you would have to do so yourself as plain VBScript perhaps by using Microsoft's XMLDOM object.
Set xmlDoc = CreateObject( "Microsoft.XMLDOM" )
I'm trying to parse an xml file
My code looks like:
string path2 = "xmlFile.xml";
XmlDocument xDoc = new XmlDocument();
xDoc.Load(path2);
XmlNodeList xnList = xDoc.DocumentElement["feed"].SelectNodes("entry");
But can't seem to get the listing of nodes. I get the error message- "Use the 'new' keyword to create an object instance." and it seems to be on 'SelectNodes("entry")'
This code worked when I loaded the xml from an rss feed, but not a local file. Can you tell me what I'm doing wrong?
My xml looks like:
<?xml version="1.0"?>
<feed xmlns:media="http://search.yahoo.com/mrss/" xmlns:gr="http://www.google.com/schemas/reader/atom/" xmlns:idx="urn:atom-extension:indexing" xmlns="http://www.w3.org/2005/Atom" idx:index="no" gr:dir="ltr">
<entry gr:crawl-timestamp-msec="1318667375230">
<title type="html">Title 1 text</title>
<summary>summary 1 text text text</summary>
</entry>
<entry gr:crawl-timestamp-msec="1318667375230">
<title type="html">title 2 text</title>
<summary>summary 2 text text text</summary>
</entry>
</feed>
Take the namespace into acount:
XmlNamespaceManager mgr = new XmlNamespaceManager(XDoc.NameTable);
mgr.AddNamespace("atom", "http://www.w3.org/2005/Atom");
XmlNodeList xnList = xDoc.SelectNodes("//atom:entry", mgr);
This is the infamous most FAQ about XPath -- referring to the names of elements that are in a default namespace.
Short answer: search for "XPath default namespace" and understand the problem.
Then use an XmlNamespaceManager instance to add an association between a prefix (say "x") and the default namespace (in your case "http://www.w3.org/2005/Atom").
Finally, replace any Name with x:Name in your XPath expression.
<div class="mvb"><b>Date 1</b></div>
<div class="mxb"><b>Header 1</b></div>
<div>
inner hmtl 1
</div>
<div class="mvb"><b>Date 2</b></div>
<div class="mxb"><b>Header 2</b></div>
<div>
inner html 2
</div>
I would like to parse the inner html between the tags in such a way that I can
* associate the inner html 1 with header 1 and date 1
* associate the inner html 2 with header 2 and date 2
In other words, at the time I parse the inner html 1 I would like to know that the html nodes containing "Date 1" and "Header 1" have been parsed (but the nodes containing "Date 2" and "Header 2" have not been parsed)
If I were doing this via regular text parsing, I would read one line at a time and record the last "Date" and "Header" than I had parsed. Then when it came time to parse the inner html 1, I could refer to the last parsed "Date" and "Header" object to associate them together.
Using the Html Agility Pack, you can leverage XPATH power - and forget about that verbose xlinq crap :-). The XPATH position() function is context sensitive. Here is a sample code:
HtmlDocument doc = new HtmlDocument();
doc.Load("your html file");
// select all DIV without a CLASS attribute defined
foreach (HtmlNode div in doc.DocumentNode.SelectNodes("//div[not(#class)]"))
{
Console.WriteLine("div=" + div.InnerText.Trim());
Console.WriteLine(" header=" + div.SelectSingleNode("preceding-sibling::div[position()=1]/b").InnerText);
Console.WriteLine(" date=" + div.SelectSingleNode("preceding-sibling::div[position()=2]/b").InnerText);
}
That will prrint this with your sample:
div=inner hmtl 1
header=Header 1
date=Date 1
div=inner html 2
header=Header 2
date=Date 2
Well, you can do this in several ways...
For example, if the HTML you want to parse is the one you wrote in your question, an easy way could be:
Store all dates in a HtmlNodeCollection
Store all headers in a HtmlNodeCollection
Store all inner texts in another HtmlNodeCollection
If everything is okay and the HTML has that layout, you will have the same number of elements in both 3 collections.
Then you can easily do:
for (int i = 0; i < innerTexts.Count; i++) {
//Get Date, Headers and Inner Texts at position i
}
The following should work:
var document = new HtmlWeb().Load("http://www.url.com"); //Or load it from a Stream, local file, etc.
var dateNodes = document.DocumentNode.SelectNodes("//div[#class='mvb']/b");
var headerNodes = document.DocumentNode.SelectNodes("//div[#class='mxb']/b");
var innerTextNodes = (from node in document.DocumentNode.SelectNodes("//div")
let previous = node.PreviousSibling
where previous.Name == "div" && previous.GetAttributeValue("class", "") == "mxb"
select node).ToList();
//Check here if the number of elements of the 3 collections are the same
for (int i = 0; i < dateNodes.Count; i++) {
var date = dateNodes[i].InnerText;
var header = headerNodes[i].InnerText;
var innerText = innerTextNodes[i].InnerText;
//Now you have the set you want: You have the Date, Header and Inner Text
}
This is a way of doing this.
Of course, you should check for exceptions (that .SelectNodes(..) method are not returning null), check for errors in the LINQ expression when storing innerTextNodes, and refactor the for (...), maybe into a method that receives a HtmlNode and returns the InnerText property of it.
Take in count that the only way you can know, in the HTML code you posted, what is the <div> tag that contains the Inner Text, is to assume it is the one that is next to the <div> tag that contains the Header. That's why I used the LINQ expression.
Another way of knowing it could be if the <div> has some particular attribute (like class="___") or similar, or if it contains some tags inside it and not just text. There is no magic when parsing HTMLs :)
Edit:
I have not tested this code. Test it by yourself and let me know if it worked.