XmlNodeList SelectNodes trouble - xpath

I'm trying to parse an xml file
My code looks like:
string path2 = "xmlFile.xml";
XmlDocument xDoc = new XmlDocument();
xDoc.Load(path2);
XmlNodeList xnList = xDoc.DocumentElement["feed"].SelectNodes("entry");
But can't seem to get the listing of nodes. I get the error message- "Use the 'new' keyword to create an object instance." and it seems to be on 'SelectNodes("entry")'
This code worked when I loaded the xml from an rss feed, but not a local file. Can you tell me what I'm doing wrong?
My xml looks like:
<?xml version="1.0"?>
<feed xmlns:media="http://search.yahoo.com/mrss/" xmlns:gr="http://www.google.com/schemas/reader/atom/" xmlns:idx="urn:atom-extension:indexing" xmlns="http://www.w3.org/2005/Atom" idx:index="no" gr:dir="ltr">
<entry gr:crawl-timestamp-msec="1318667375230">
<title type="html">Title 1 text</title>
<summary>summary 1 text text text</summary>
</entry>
<entry gr:crawl-timestamp-msec="1318667375230">
<title type="html">title 2 text</title>
<summary>summary 2 text text text</summary>
</entry>
</feed>

Take the namespace into acount:
XmlNamespaceManager mgr = new XmlNamespaceManager(XDoc.NameTable);
mgr.AddNamespace("atom", "http://www.w3.org/2005/Atom");
XmlNodeList xnList = xDoc.SelectNodes("//atom:entry", mgr);

This is the infamous most FAQ about XPath -- referring to the names of elements that are in a default namespace.
Short answer: search for "XPath default namespace" and understand the problem.
Then use an XmlNamespaceManager instance to add an association between a prefix (say "x") and the default namespace (in your case "http://www.w3.org/2005/Atom").
Finally, replace any Name with x:Name in your XPath expression.

Related

Get attribute value from XML

I have this chunk of XML:
<show name="Are We There Yet?">
<sid>24588</sid>
<network>TBS</network>
<title>The Kwandanegaba Children's Fund Episode</title>
<ep>03x31</ep>
<link>
http://www.tvrage.com/shows/id-24588/episodes/1065228407
</link>
</show>
I am trying to get "Are we there yet?" via Nokogiri. It is effectively the 'name' attribute of 'show'. I'm struggling to figure out how to parse this.
xml.at_css('show').value was my best guess but doesn't work.
You can use the following:
xml.at('//show/#name').text
which is XPath expression that returns the name attribute from the show element.
Use:
require 'nokogiri'
xml =<<EOT
<show name="Are We There Yet?">
<sid>24588</sid>
<network>TBS</network>
<title>The Kwandanegaba Children's Fund Episode</title>
<ep>03x31</ep>
<link>
http://www.tvrage.com/shows/id-24588/episodes/1065228407
</link>
</show>
EOT
xml = Nokogiri::XML(xml)
puts xml.at('show')['name']
=> Are We There Yet?
at accepts either CSS or XPath expressions, so feel free to use it for both. Use at_css or at_xpath if you know you need to declare the expression as CSS or XPath, respectively. at returns a Node, so you can simply reference the parameters of the node like you would a hash.

Nokogiri (Ruby): Extract tag contents for a specific attribute inside each node

I have a XML with the following structure
<Root>
<Batch name="value">
<Document id="ID1">
<Tags>
<Tag id="ID11" name="name11">Contents</Tag>
<Tag id="ID12" name="name12">Contents</Tag>
</Tags>
</Document>
<Document id="ID2">
<Tags>
<Tag id="ID21" name="name21">Contents</Tag>
<Tag id="ID22" name="name22">Contents</Tag>
</Tags>
</Document>
</Batch>
</Root>
I want to extract the contents of specific tags for each Document node, using something like this:
xml.xpath('//Document/Tags').each do |node|
puts xml.xpath('//Root/Batch/Document/Tags/Tag[#id="ID11"]').text
end
Which is expected to extract the contents of the tag with id = "ID11" for each 2 nodes, but retrieves nothing. Any ideas?
You have a minor error in the xpath, you are using /Documents/Document while the XML you pasted is a bit different.
This should work:
//Root/Batch/Document/Tags/Tag[#id="ID11"]
My favorite way to do this is by using the #css method like this:
xml.css('Tag[#id="ID11"]').each do |node|
puts node.text
end
It seemed that xpath used was wrong.
'//Root/Batch/Documents/Document/Tags/Tag[#id="ID11"]'
shoud be
'//Root/Batch/Document/Tags/Tag[#id="ID11"]'
I managed to get it working with the following code:
xml.xpath('//Document/Tags').each do |node|
node.xpath("Tag[#id='ID11']").text
end

XQuery return text node if it contains given keyword

A test sample of my xml file is shown below:
test.xml
<feed>
<entry>
<title>Link ISBN</title>
<libx:libapp xmlns:libx="http://libx.org/xml/libx2" />
</entry>
<entry>
<title>Link Something</title>
<libx:module xmlns:libx="http://libx.org/xml/libx2" />
</entry>
</feed>
Now, I want to write an xquery which will find all <entry> elements which have <libx:libapp> as a child. Then, for all such entries return the title if the title contains a given keyword (such as Link). So, in my example xml document the xquery should return "Link ISBN".
My sample xquery is shown below:
samplequery.xq (here doc_name is the xml file shown above and libapp_matchkey is a keyword such as 'Link')
declare namespace libx='http://libx.org/xml/libx2';
declare variable $doc_name as xs:string external;
declare variable $libpp_matchkey as xs:string external;
let $feeds_doc := doc($doc_name)
for $entry in $feeds_doc/feed/entry
(: test whether entry has libx:libapp child and has "Link" in its title child :)
where ($entry/libx:libapp and $entry/title/text()[contains(.,$libapp_matchkey)])
return $entry/title/text()
This xquery is returning null instead of the expected result 'Link ISBN'. Why is that?
I want to write an xquery which will
find all elements which have
as a child. Then, for
all such entries return the title if
the title contains a given keyword
(such as Link).
Just use:
/*/entry[libx:libapp]/title[contains(.,'Link')]/text()
Wrapping this XPath expression in XQuery we get:
declare namespace libx='http://libx.org/xml/libx2';
/*/entry[libx:libapp]/title[contains(.,'Link')]/text()
when applied on the provided XML document:
<feed>
<entry>
<title>Link ISBN</title>
<libx:libapp xmlns:libx="http://libx.org/xml/libx2" />
</entry>
<entry>
<title>Link Something</title>
<libx:module xmlns:libx="http://libx.org/xml/libx2" />
</entry>
</feed>
the wanted, correct result is produced:
Link ISBN

How do I remove the <opt> tag in XML::Simple output?

I'm creating an XML file using Perl and XML::Simple module. I successfully create the XML file, but the problem is I am having <opt> </opt> tag for each my tags. I am looking for any option which we can aviod the <opt> </opt> tag. I can't do the post-processing to remove the tag. because the file size is huge.
Example :
<opt>
<person firstname="Joe" lastname="Smith">
<email>joe#smith.com</email>
<email>jsmith#yahoo.com</email>
</person>
<person firstname="Bob" lastname="Smith">
<email>bob#smith.com</email>
</person>
</opt>
and I am looking for (without <opt> tag):
<person firstname="Joe" lastname="Smith">
<email>joe#smith.com</email>
<email>jsmith#yahoo.com</email>
</person>
<person firstname="Bob" lastname="Smith">
<email>bob#smith.com</email>
</person>
The tag is the root element of the XML generated from the user-supplied data-structure.
From the XML::Simple documentation -
RootName => 'string' # out - handy
By default, when XMLout() generates
XML, the root element will be named
'opt'. This option allows you to
specify an alternative name.
Specifying either undef or the empty
string for the RootName option will
produce XML with no root elements. In
most cases the resulting XML fragment
will not be 'well formed' and
therefore could not be read back in by
XMLin(). Nevertheless, the option has
been found to be useful in certain
circumstances.
To set the root element to blank just pass RootName as 'undef' to XMLout, for eg.
use XML::Simple;
my $xml = XMLout($hashref, RootName => undef);
I came across this answer when searching for the same info (read, parse, modify, and output xml, fix the <opt> root tag) but in Ruby.
FYI, the root node can also be removed or named in the Ruby version of the library:
require 'xmlsimple' # gem install xml-simple
data = XmlSimple.xml_in(filename) # read data from filename
# Parse data as needed, then output:
XmlSimple.xml_out(data, { 'RootName' => nil }) # Remove root element
XmlSimple.xml_out(data, { 'RootName' => 'html' }) # Change root <opt> to <html>
The above answer did not work for me. What you can do is:
my $xml = XML::Simple->new(KeepRoot=>0);
print $xml->XMLout($YourVariable);
That said, a valid XML document should have a root. If what you want to do is name your root node, you can do this:
print $xml->XMLout({'RootNodeName' => {'ChildNode'=>[#ArrayOfThings]}});

How do I retrieve element text inside CDATA markup via XPath?

Consider the following xml fragment:
<Obj>
<Name><![CDATA[SomeText]]></Name>
</Obj>
How do I retrieve the "SomeText" value via XPath? I'm using Nauman Leghari's (excellent) Visual XPath tool.
/Obj/Name returns the element
/Obj/Name/text() returns blank
I don't think its a problem with the tool (I may be wrong) - I also read XPath can't extract CDATA (See last response in this thread) - which sounds kinda weird to me.
/Obj/Name/text() is the XPath to return the content of the CDATA markup.
What threw me off was the behavior of the Value property. For an XMLNode (DOM world), the XmlNode.Value property of an Element (with CDATA or otherwise) returns Null. The InnerText property would give you the CDATA/Text content.
If you use Xml.Linq, XElement.Value returns the CDATA content.
string sXml = #"
<object>
<name><![CDATA[SomeText]]></name>
<name>OtherName</name>
</object>";
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml( sXml );
XmlNamespaceManager nsMgr = new XmlNamespaceManager(xmlDoc.NameTable);
Console.WriteLine(#"XPath = /object/name" );
WriteNodesToConsole(xmlDoc.SelectNodes("/object/name", nsMgr));
Console.WriteLine(#"XPath = /object/name/text()" );
WriteNodesToConsole( xmlDoc.SelectNodes("/object/name/text()", nsMgr) );
Console.WriteLine(#"Xml.Linq = obRoot.Elements(""name"")");
XElement obRoot = XElement.Parse( sXml );
WriteNodesToConsole( obRoot.Elements("name") );
Output:
XPath = /object/name
NodeType = Element
Value = <null>
OuterXml = <name><![CDATA[SomeText]]></name>
InnerXml = <![CDATA[SomeText]]>
InnerText = SomeText
NodeType = Element
Value = <null>
OuterXml = <name>OtherName</name>
InnerXml = OtherName
InnerText = OtherName
XPath = /object/name/text()
NodeType = CDATA
Value = SomeText
OuterXml = <![CDATA[SomeText]]>
InnerXml =
InnerText = SomeText
NodeType = Text
Value = OtherName
OuterXml = OtherName
InnerXml =
InnerText = OtherName
Xml.Linq = obRoot.Elements("name")
Value = SomeText
Value = OtherName
Turned out the author of Visual XPath had a TODO for the CDATA type of XmlNodes. A little code snippet and I have CDATA support now.
MainForm.cs
private void Xml2Tree( TreeNode tNode, XmlNode xNode)
{
...
case XmlNodeType.CDATA:
//MessageBox.Show("TODO: XmlNodeType.CDATA");
// Gishu
TreeNode cdataNode = new TreeNode("![CDATA[" + xNode.Value + "]]");
cdataNode.ForeColor = Color.Blue;
cdataNode.NodeFont = new Font("Tahoma", 12);
tNode.Nodes.Add(cdataNode);
//Gishu
break;
CDATA sections are just part of what in XPath is known as a text node or in the XML Infoset as "chunks of character information items".
Obviously, your tool is wrong. Other tools, as the XPath Visualizer correctly highlight the text of the Name element when evaluating this XPath expression:
/*/Name/text()
One can also write a simple XSLT transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
"<xsl:value-of select="/*/Name"/>"
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<Obj>
<Name><![CDATA[SomeText]]></Name>
</Obj>
the correct result is produced:
"SomeText"
i think the thread you referenced says that the CDATA markup itself is ignored by XPATH, not the text contained in the CDATA markup.
my guess is that its an issue with the tool, the source code is available for download, maybe you can debug it...
See if this helps - http://www.zrinity.com/xml/xpath/
XPATH = /Obj/Name/text()
Just in case you run into a similar issue with jdom2, text() will be an array.
To recover CDATA, use /Obj/Name/text()
A suggestion would be to have another field of the md5 hash of the cdata. You can then use xpath to query based off the md5 with no issue
<sites>
<site>
<name>Google</name>
<url><![CDATA[http://www.google.com]]></url>
<urlMD5>ed646a3334ca891fd3467db131372140</urlMD5>
</site>
</sites>
Then you can search:
/sites/site[urlMD5=ed646a3334ca891fd3467db131372140]

Resources