Any way to strip namespace garbage from XML file?

Any way to strip namespace garbage from XML file? - xpath

I need to select some nodes from an XML file (AppNamespace.xaml from a Silverlight XAP file, not that it matters), but the file has namespace stuff so XPath doesn't work. I could waste most of a day trial-and-erroring the bondage-and-discipline nightmare of XmlNamespaceManager and end up with hopelessly fragile code that can't tolerate the slightest variation in the input file (not a great idea in production code), or I could use the ludicrous local-name() syntax[1].
But it would be more convenient to use XPath as a human-readable query language that can be used to return specified nodes or attribute values from arbitrary XML files.
So is there any way to strip the line-noise out of the file? Or am I stuck? Is the labyrinthine imbecility of Linq-to-XML truly the lesser evil?
[1]
//*[local-name() = 'Deployment']/*[local-name() = 'Deployment.Parts']/*[local-name() = 'AssemblyPart']/#*[local-name()='Name']
Update
Five years down the road, I stand behind the term "labyrinthine imbecility" with every fiber of my being, except for a few fibers that want to use something much stronger.

Ed, here's an example of using namespaces with the System.Xml.XPath Extensions class. I've modified it to match the input you're looking at:
string markup = #"
<Deployment xmlns="http://schemas.microsoft.com/client/2007/deployment"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" ...>
<Deployment.Parts>
<AssemblyPart x:Name="xamlName" Source="assembly" />
</Deployment.Parts>
</Deployment>
";
XmlReader reader = XmlReader.Create(new StringReader(markup));
XElement root = XElement.Load(reader);
XmlNameTable nameTable = reader.NameTable;
XmlNamespaceManager namespaceManager = new XmlNamespaceManager(nameTable);
nsm.AddNamespace("x", "http://schemas.microsoft.com/winfx/2006/xaml");
nsm.AddNamespace("dep", "http://schemas.microsoft.com/client/2007/deployment");
IEnumerable<XElement> elements =
root.XPathSelectElements("//dep:Deployment/dep:Deployment.Parts/dep:AssemblyPart/#x:Name", nsm);
foreach (XElement el in elements)
Console.WriteLine(el);
Not very complicated. Obviously you already know about XmlNamespaceManager, but I think you got a worse impression of it than it deserves.
When you say "hopelessly fragile code that can't tolerate the slightest variation in the input file", are you blaming namespaces in general, or XmlNamespaceManager? I don't see how either one makes it fragile... any more so than XML processing code without namespaces will not tolerate certain changes in the input document, but will tolerate others.
Have a little respect for other intelligent people in the industry, take a little time to understand the advantages behind a design before you dismiss it, and you will usually find that there are good reasons for what was done.
Not that XML namespaces couldn't be improved upon. However nobody has managed to produce a better standard and get it accepted by the community.

In XPath 2.0 you can use namespace wildcards (if you know what you are doing):
//*:Deployment/*:Deployment.Parts/*:AssemblyPart/#Name
btw. If an attribute doesn't have a prefix it is in no namespace at all. As this is most often the case, I guess, you don't need local-name() for the attribute.

I came here as a result of this search:
and I am adding an "Answer" to cheer on your "5 years on" update.
I was motivated to do this because I have an XML document that uses a tonne of namespaces -
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:html="http://www.w3.org/TR/REC-html40" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:x2="urn:schemas-microsoft-com:office:excel2" version="1.0" exclude-result-prefixes="msxsl">
and APPARENTLY I have to know what all those namespaces are in advance in order to hard code the XmlNamespaceManager, or write some code that parses the namespace declarations and adds the relevant name spaces myself. Why in the name of all that is holy does the XmlDocument not manage to do that all by itself?
XmlDocument databaseXml = new XmlDocument();
databaseXml.LoadXml(xslt.XslTransform);
var dbnsmgr = new XmlNamespaceManager(databaseXml.NameTable);
dbnsmgr.AddNamespace("xsl", "http://www.w3.org/1999/XSL/Transform");
dbnsmgr.AddNamespace("ss", "urn:schemas-microsoft-com:office:spreadsheet");
XmlElement databaseStylesElement = (XmlElement)database
Xml.DocumentElement.SelectSingleNode("/xsl:stylesheet/xsl:template");

Related

Nokogiri not parsing XML in ruby - xmlns issue?

Given the following ruby code :
require 'nokogiri'
xml = "<?xml version='1.0' encoding='UTF-8'?>
<ProgramList xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns='http://publisher.webservices.affili.net/'>
<TotalRecords>145</TotalRecords>
<Programs>
<ProgramSummary>
<ProgramID>6540</ProgramID>
<Title>Matalan</Title>
<Limitations>A bit of text
</Limitations>
<URL>http://www.matalan.co.uk</URL>
<ScreenshotURL>http://www.matalan.co.uk/</ScreenshotURL>
<LaunchDate>2009-11-02T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
<ProgramSummary>
<ProgramID>11787</ProgramID>
<Title>Club 18-30</Title>
<Limitations/>
<URL>http://www.club18-30.com/</URL>
<ScreenshotURL>http://www.club18-30.com</ScreenshotURL>
<LaunchDate>2013-05-16T00:00:00</LaunchDate>
<Status>1</Status>
</ProgramSummary>
</Programs>
</ProgramList>"
doc = Nokogiri::XML(xml)
p doc.xpath("//Programs")
gives :
=> []
Not what is expected.
On further investigation if I remove xmlns='http://publisher.webservices.affili.net/' from the initial <ProgramList> tag I get the expected output.
Indeed if I change xmlns='http://publisher.webservices.affili.net/' to xmlns:anything='http://publisher.webservices.affili.net/' I get the expected output.
So my question is what is going on here? Is this malformed XML? And what is the best strategy for dealing with it?
While it's hardcoded in this example the XML is (will be) coming from a web service.
Update
I realise I can use the remove_namespaces! method but the Nokogiri docs do say that it's "...probably is not a good thing in general" to do this. Also I'm interested in why it's happening and what the 'correct' XML should be.

The xmlns='http://publisher.webservices.affili.net/' indicates the default namespace for all elements under the one where it appears (including the element itself). That means that all elements that don’t otherwise have an explicit namespace fall under this namespace.
XPath queries don’t have default namespaces (at least in XPath 1.0), so any name that appears in one without a prefix refers to that element in no namespace.
In your code, you want to find Program elements in the http://publisher.webservices.affili.net/ namespace (since that is the default namespace), but are looking (in your XPath query) for Program elements in no namespace.
To explicitly specify the namespace in the query, you can do something like this:
doc.xpath("//pub:Programs", "pub" => "http://publisher.webservices.affili.net/")
Nokogiri makes this a little easier for namespaces declared on the root element (as in this case), declaring them for you with the same prefix. It will also declare the default namespace using the xmlns prefix, so you can also do:
doc.xpath("//xmlns:Programs")
which will give you the same result.

Parsing huge (~100mb) kml (xml) file taking hours without any sign of actual parsing

I'm currently trying to parse a very large kml (xml) file with ruby (Nokogiri) and am having a little bit of trouble.
The parsing code is good, in fact I'll share it just for the heck of it, even though this code doesn't have much to do with my problem:
geofactory = RGeo::Geographic.projected_factory(:projection_proj4 => "+proj=lcc +lat_1=34.83333333333334 +lat_2=32.5 +lat_0=31.83333333333333 +lon_0=-81 +x_0=609600 +y_0=0 +ellps=GRS80 +to_meter=0.3048 +no_defs", :projection_srid => 3361)
f = File.open("horry_parcels.kml")
kmldoc = Nokogiri::XML(f)
kmldoc.css("//Placemark").each_with_index do |placemark, i|
puts i
tds = Nokogiri::HTML(placemark.search("//description").children[0].to_html).search("tr > td")
h = HorryParcel.new
h.owner_name = tds.shift.text
tds.shift
tds.each_slice(2) do |k, v|
col = k.text.downcase
eval("h.#{col} = v.text")
end
coords = kmldoc.search("//MultiGeometry")[i].text.gsub("\n", "").gsub("\t", "").split(",0 ").map {|x| x.split(",")}
points = coords.map { |lon, lat| geofactory.parse_wkt("POINT (#{lon} #{lat})") }
geo_shape = geofactory.polygon(geofactory.linear_ring(points))
proj_shape = geo_shape.projection
h.geo_shape = geo_shape
h.proj_shape = proj_shape
h.save
end
Anyway, I've tested this code with a much, much smaller sample of kml and it works.
However, when I load the real thing, ruby simply waits, as if it is processing something. This "processing", however, has now spanned several hours while I've been doing other things. As you might have noticed, I have a counter (each_with_index) on the array of Placemarks and during this multi-hour period, not a single i value has been put to the command line. Oddly enough it hasn't timed out yet, but even if this works there has got to be a better way to do this thing.
I know I could open up the KML file in Google Earth (Google Earth Pro here) and save the data in smaller, more manageable kml files, but the way things appear to be set up, this would be a very manual, unprofessional process.
Here's a sample of the kml (w/ just one placemark) if that helps.
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document>
<name>justone.kml</name>
<Style id="PolyStyle00">
<LabelStyle>
<color>00000000</color>
<scale>0</scale>
</LabelStyle>
<LineStyle>
<color>ff0000ff</color>
</LineStyle>
<PolyStyle>
<color>00f0f0f0</color>
</PolyStyle>
</Style>
<Folder>
<name>justone</name>
<open>1</open>
<Placemark id="ID_010161">
<name>STUART CHARLES A JR</name>
<Snippet maxLines="0"></Snippet>
<description>""</description>
<styleUrl>#PolyStyle00</styleUrl>
<MultiGeometry>
<Polygon>
<outerBoundaryIs>
<LinearRing>
<coordinates>
-78.941896,33.867893,0 -78.942514,33.868632,0 -78.94342899999999,33.869705,0 -78.943708,33.870083,0 -78.94466799999999,33.871142,0 -78.94511900000001,33.871639,0 -78.94541099999999,33.871776,0 -78.94635,33.872216,0 -78.94637899999999,33.872229,0 -78.94691400000001,33.87248,0 -78.94708300000001,33.87256,0 -78.94783700000001,33.872918,0 -78.947889,33.872942,0 -78.948655,33.873309,0 -78.949589,33.873756,0 -78.950164,33.87403,0 -78.9507,33.873432,0 -78.95077000000001,33.873384,0 -78.950867,33.873354,0 -78.95093199999999,33.873334,0 -78.952518,33.871631,0 -78.95400600000001,33.869583,0 -78.955254,33.867865,0 -78.954606,33.867499,0 -78.953833,33.867172,0 -78.952994,33.866809,0 -78.95272799999999,33.867129,0 -78.952139,33.866803,0 -78.95152299999999,33.86645,0 -78.95134299999999,33.866649,0 -78.95116400000001,33.866847,0 -78.949281,33.867363,0 -78.948936,33.866599,0 -78.94721699999999,33.866927,0 -78.941896,33.867893,0
</coordinates>
</LinearRing>
</outerBoundaryIs>
</Polygon>
</MultiGeometry>
</Placemark>
</Folder>
</Document>
</kml>
EDIT:
99.9% of the data I work with is in *.shp format, so I've just ignored this problem for the past week. But I'm going to get this process running on my desktop computer (off of my laptop) and run it until it either times out or finishes.
class ClassName
attr_reader :before, :after
def go
#before = Time.now
run_actual_code
#after = Time.now
puts "process took #{(#after - #before) seconds} to complete"
end
def run_actual_code
...
end
end
The above code should tell me how long it took. From that (if it does actually finish) we should be able to compute a rough rule of thumb for how long you should expect your (otherwise PERFECT) code to run without SAX parsing or "atomization" of the document's text components.

For a huge XML file, you should not use default XML parser from Nokogiri, because it parses as DOM. A much better parsing strategy for large XML files is SAX. Luckly we are, Nokogiri supports SAX.
The downside is that using a SAX parser all logic should be done with callbacks. The idea is simple: The sax parser starts to read a file and let you know whenever it finds something interesting, for example a tag opening, a tag close, or a text. You will be able to bind callbacks to these events, and extract whatever you need.
Of course you don't want to use a SAX parser to load all file into the memory and work with it there - this is exactly what SAX want to avoid. You will need to do whatever you want with this file part-by-part.
So this is basically a rewrite your parsing with callbacks logic. To learn more about XML DOM vs SAX parsers, you might want to check this FAQ from cs.nmsu.edu

I actually ended up getting a copy of the data from a more accessible source, but I'm back here because I wanted to present a possible solution to the general problem. Less. Less was a built long time ago & is a part of unix by default in most cases.
http://en.wikipedia.org/wiki/Less_%28Unix%29
Not related to the stylesheet language ("LESS"), less is a text viewer (cannot edit files, only read them) that does not load the entire document it is reading until you have scanned through the entire thing yourself. I.e., it loads the first "page", so to speak, and waits for you to call for the next one.
If a ruby script could somehow pipe "pages" of text into...oh wait....the XML structure wouldn't allow it due to the fact that it wouldn't have the closing delimeters from the end of the undigested text file......So what you would have to do is some custom work on the front end, cut out those first couple parent brackets so that you can pluck out the XML children one by one and have the last closing parent brackets break the script because the parser will think it is finished and come across another closing bracket I guess.
I haven't tried this and don't have anything to try it on. But if I did, I'd probably try piping n-lot blocks of text into ruby (or python, etc) via less or something similar to it. Perhaps something more primitive than less I'm not sure

how to replace a string in a ruby file after a match is found

I have a xml file, which i need to modify from my ruby script and save it. xml file looks something like
`
<mtn:messages>
<mtn:message correlation-key="0x" sequence="4">
<mtn:header>
<mtn:protocol-version>0x4</mtn:protocol-version>
<mtn:message-type>0x0F04</mtn:message-type>
<mtn:ttl>4</mtn:ttl>
<mtn:qos-class-of-service>0</mtn:qos-class-of-service>
<mtn:qos-priority>2</mtn:qos-priority>
</mtn:header>
</mtn:message>
</mtn:messages>
</mtn:test-case>
<mtn:test-case title="Train-Consist-Message">
<mtn:messages>
<mtn:message correlation-key="0x" sequence="4">
<mtn:header>
<mtn:protocol-version>0x4</mtn:protocol-version>
<mtn:message-type>0x0F04</mtn:message-type>
<mtn:ttl>4</mtn:ttl>
<mtn:qos-class-of-service>0</mtn:qos-class-of-service>
<mtn:qos-priority>2</mtn:qos-priority>
</mtn:header>
</mtn:message>
</mtn:messages>
</mtn:test-case>`
I need to replace <mtn:ttl>4</mtn:ttl> with <mtn:ttl>some other value</mtn:ttl> which comes under <mtn:test-case title="Train-Consist-Message"> and save it.
I have written below code, but its replacing all occurances of <mtn:ttl>4</mtn:ttl>.
`doc = IO.read(ENV['CadPath1']+ "conf\\cad-mtn-config.xml")
doc.gsub!(pattern, str)
File.open("File path", "w"){|fh| fh.write(doc)}`
Please help me with this. Waiting for your early reply...

String#gsub! modifies the string in-place, replacing all instances with the replacement specified. If you only want to replace the first instance, use String#sub or String.sub!.

The suggestion from Mike about using sub instead of gsub is good. But parsing XML (and HTML) with regular expression is usually frowned upon.
From your question I assume that you locate the to-be-modified element in terms of parent-child relations, not in terms of the source code order (i.e. you will not be able to say: "modify the second occurrence of this pattern"), so inventing a reliable regular expressions may be very, very hard.
You should use a parser library to find the element you want to change. There is a pretty large collection of those. See some of them at http://ruby-toolbox.com/categories/html_parsing.html and pick one, or use a built-in REXML library.
Alternatively, you could use a very simple 'html-scanner' module, which is included in Rails' ActionController (action_controller/vendor/html-scanner.rb), but if you do not use Rails, I am not sure whether extracting it is worth your time.
The exact code will depend on the parser you choose. Usually they have pretty good documentation/tutorials, so I am sure you will be able to handle it.

Select default namespace in XPath with HtmlUnit

I want to parse a Feedburner feed with HtmlUnit.
The feed is this one: http://feeds.feedburner.com/alcoanewsreleases
From this feed I want to read all item nodes, so normally a //item XPath should do the trick. Unfortunately that does not work in this case.
groovy code snippet:
def page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases")
def elements = page.getByXPath("//item")
Sample of the XML feed:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss1full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
[...SNIP...]
<item rdf:about="http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2011&pageID=20110518006002en">
<title>Chris L. Ayers Named President, Alcoa Global Primary Products</title>
<dc:date>2011-05-18</dc:date
<link>http://feedproxy.google.com/~r/alcoanewsreleases/~3/PawvdhpJrkc/news_detail.asp</link>
<description>NEW YORK--(BUSINESS WIRE)--Alcoa (NYSE:AA) announced today that Chris L. Ayers has been named President of Alcoa’s Global Primary Products (GPP) business, effective May 18, 2011. Ayers, previously Chief Operating Officer of GPP, succeeds John Thuestad, who will be handling special projects for the Company. Ayers joined Alcoa in February 2010 as Chief Operating Officer of Alcoa Cast, Forged and Extruded Products, a new position. He was elected a Vice President of Alcoa in April 2010 and Executive</description>
<feedburner:origLink xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2010&pageID=20100104006194en</feedburner:origLink>
</item>
[...SNIP...]
</rdf:RDF>
I suspect this to be an issue with the namespaces because this document has 4 namespaces. The namespaces are
(this is the default) xmlns="http://purl.org/rss/1.0/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"
I have tried to use Nokogiri with this (another XML Parser that I use for ruby scripts).
With Nokogiri I could just us the XPath //xmlns:item which works and returns all nodes from the feed.
I have tried the same XPath with HtmlUnit but it does not work.
So I think I can phrase my question as:
How can I select a node from the default namespace with HtmlUnit?
Any ideas?

From this feed I want to read all item
nodes, so normally a //item XPath
should do the trick. Unfortunately
that does not work in this case.
In XPath, that means "select all elements whose local name is item that are in no namespace". In RSS, the item elements must be in a namespace. So the above should never work with a conforming XML parser and XPath engine.
What's confusing is that in XML, <item> means "an element named item that is in the default namespace, i.e. whatever default namespace is in scope at this place in the document;" whereas in XPath, "item" means an element in no namespace. (Or, you could say, it means an element in the default namespace, but unless you have a way to tell XPath what the default namespace is, the default namespace is no namespace. Usually (always?) in XPath 1.0 there is no way to declare the default namespace for XPath expressions.)
The other confusing thing to beginners is that the namespace prefix mappings in the source XML document are not considered significant by the XPath processor. When the XML document is parsed, a data structure is built that remembers the name and namespace of every element (and other nodes). The namespace prefixes used, including the empty prefix of the default namespace, are considered mere syntactic convenience. More on this below...
With Nokogiri I could just us the
XPath //xmlns:item which works and
returns all nodes from the feed.
Whatever that is, it's not XPath. Maybe it's a Nokogiri extension to it (a very convenient one, but its syntax is really counter-intuitive).
So I think I can phrase my question
as: How can I select a node from the
default namespace with HtmlUnit?
Let's phrase it as: How can I select the RSS item elements with HtmlUnit? I phrase it that way because the RSS spec (actually in general any conforming XML vocabulary spec) does not require that its elements will be in the default namespace. That happens to be true in the sample you received, but the service provider could change that tomorrow and still be perfectly conformant to RSS. Tomorrow, the service provider could use the "rss" namespace prefix for that namespace; or any other arbitrary prefix. What RSS does specify is what namespace its elements will be in: the namespace whose URI is http://purl.org/rss/1.0/.
It's kind of like asking, "How do I write a function (in Javascript, C, Java, etc.) that can tell me the value of the variable a?" Usually a function has no idea what variable name was used for what in the caller. All it knows are the values of its arguments. If you call sqrt(4), you'll get the same answer as with a = 4; sqrt(a) or rumpelstiltzkin = 4; sqrt(rumpelstiltzkin). Clearly, the name of the variable argument has no direct effect on the result of the function call. It just needs to be the name of a variable that holds the right value. If a compiler complained because you wrote b = 4; return sqrt(b) instead of using a, you'd think that compiler was nuts. It's not supposed to care about variable names as long as you use valid identifiers.
In the same way, when processing RSS, we're not supposed to care about what namespace prefix is used, as long as it's a prefix that identifies the right namespace. It could be no prefix (which identifies the default namespace).
In XPath 2.0, you can wildcard the namespace. This is very handy if you know you're not going to need namespaces for disambiguation. In that case you can select //*:item. However, I don't think HTMLUnit supports XPath 2.0. Also in XPath 2.0 environments like XSLT 2.0, you can specify a default namespace for XPath expressions, but that won't help you in HTMLUnit.
So you have a couple of choices:
Use an XPath expression that ignores namespaces, such as //*[local-name() = 'item'].
or
The robust way: Register a namespace prefix for http://purl.org/rss/1.0/ and use it in your XPath expression: //rss:item. The question then becomes, how do you register a namespace prefix in HTMLUnit and pass it to the XPath processor? I took a quick look in the docs and didn't find any facility for doing that.
Caveat: I should add that the above is in regard to conforming XPath processors. I have no idea what XPath processor HTMLUnit uses. There are some XPath processors out there that ignore the specs and make the world more confusing for everybody.
I saw here that someone used the following syntax for elements in the default namespace in HTMLUnit:
//:item
But I wouldn't recommend that, for three reasons:
It's not valid XPath, so you can't expect it to work with other programs.
It will only work on RSS feeds that declare the RSS namespace to be the default namespace. RSS feeds that use a namespace prefix will cause the above to fail.
It will hold you back from learning how XML namespaces really work, and it will help preserve the status quo of tools that don't adequately support namespaces.
HTMLUnit is primarily designed for HTML, so incomplete handling of XML is understandable. But claiming to support XPath and then not providing ways to declare namespace prefixes is a bug. HTMLUnit uses an XPath package that seems to be part of Xalan-J. That package has ways to provide namespace mappings to XPath, but I don't know if HTMLUnit exposes that functionality.

This sounds familiar enough that I'm quite sure I've used namespaces and XPath successfully with HtmlUnit in the past, but of course I can't find the code. I suspect it must have been with HTML pages only: the page reference in your example is an XmlPage which has a number of methods specific to namespaces, all of which throw a "not implemented yet" exception when used. :-(
The current version (2.8) of HtmlUnit is nearly a year old, so it may be that some work has been done in the meantime to support XML namespaces. The "HtmlUnit Users" mailing list would be the place to find out.
In the meantime, as always there is a workaround:
final XmlPage page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases");
// no good
List elements = page.getByXPath("//item");
System.out.println( elements.size() ) ;
// ugly, but it works
DomElement de = (DomElement)page.getFirstByXPath( "//rdf:RDF" );
List<DomNode> items = new ArrayList<DomNode>() ;
for( DomNode dn : de.getChildNodes() )
{
String name = dn.getLocalName() ;
if( ( name != null ) && ( name.equals( "item" ) ) )
items.add( dn ) ;
}
System.out.println( "found " + items.size() ) ;
Oh boy Java is painful after working in Scala... ;-)

Accessing Comments in XML using XPath

How to access the comments inside the XML document using XPath?
For example:
<table>
<length> 12 </length>
<!--Some comment here-->
</table>
I want to access the "Some comment here".
Thanks...
EDIT: I am using MSXML DOM ActiveX and the command comment() seems to be failing... Any idea why?

With the path
/foo/bar/comment()
you can select all comments in the /foo/bar element. May depend on your language of choice, of course. But generally this is how you do it.

Use comment() function for example:-
/table/length/following::comment()[1]
selects the first comment that follows the length element.
Edit
Manoj asks in a comment to this answer why this isn't working in MSXML. The reason will be you are using MSXML3. By default MSXML3 does not use XPath as its selection language, it defaults to an earlier much weaker language (XSL pattern). You need to set XPath as the selection language via the DOMDocument's setProperty method. E.g (in JScript):-
var dom = new ActiveXObject("MSXML2.DOMDocument.3.0");
dom.setProperty("SelectionLanguage", "XPath");
Now the full XPath language will work in your queries (note one breaking change is indexer predicates are 1 based in XPath whereas they were 0 based in XSL Pattern).

Based on the OP's comments to posted answers (and my curiosity as to why this simple thing would not work), here is my suggestion:
Using the XPath expression suggested by #Anthony, I was able to successfully load the comment node with the following JS function:
function SelectComment(s)
{
var xDoc = new ActiveXObject("MSXML2.DOMDocument.6.0");
if (xDoc)
{
xDoc.loadXML(s);
var selNode = xDoc.selectSingleNode("/table/length/following::comment()[1]");
if (selNode != null)
return selNode.text;
else
return "";
}
}
Sample invocation:
SelectComment("<table><length> 12</length><!--Some comment here--></table>");
Output:
"Some comment here"
Notes:
a. Your MSXML version may vary. Please use appropriately.
b. This kind of code is definitely not recommended because it works only on IE. However, since this is your explicitly stated requirement, I have used the ActiveXObject.
c. You have not mentioned in your comments what fails in the suggested XPath expressions. My guess is that you are not querying the text property of the retrieved node. Keep in mind that the SelectSingleNode always returns an IXmlNode and you need to query its data or text properties.

Maybe this coud help,
This sample removes Comments
XmlNodeList list = xmlDoc.SelectNodes("//comment()");
foreach(XmlNode node in list)
node.ParentNode.RemoveChild(node);
Leaned from here link text

<adjustment>
<!-- krishna k -->
<name>FX Update USD</name>
<!-- Since this plan updates existing adj's no ajd's will be created using this id -->
<id>7206</id>
Am facing the similar Issue my application is reading comments which causes stack crash. How can I avoid reading comments by DOM.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Any way to strip namespace garbage from XML file? - xpath

In XPath 2.0 you can use namespace wildcards (if you know what you are doing): //:Deployment/:Deployment.Parts/*:AssemblyPart/#Name btw. If an attribute doesn't have a prefix it is in no namespace at all. As this is most often the case, I guess, you don't need local-name() for the attribute.

Related

Nokogiri not parsing XML in ruby - xmlns issue?

Parsing huge (~100mb) kml (xml) file taking hours without any sign of actual parsing

how to replace a string in a ruby file after a match is found

Select default namespace in XPath with HtmlUnit

Accessing Comments in XML using XPath

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Any way to strip namespace garbage from XML file? - xpath

In XPath 2.0 you can use namespace wildcards (if you know what you are doing): //*:Deployment/*:Deployment.Parts/*:AssemblyPart/#Name btw. If an attribute doesn't have a prefix it is in no namespace at all. As this is most often the case, I guess, you don't need local-name() for the attribute.

Related

Nokogiri not parsing XML in ruby - xmlns issue?

Parsing huge (~100mb) kml (xml) file taking *hours* without any sign of actual parsing

how to replace a string in a ruby file after a match is found

Select default namespace in XPath with HtmlUnit

Accessing Comments in XML using XPath

Categories

Resources

In XPath 2.0 you can use namespace wildcards (if you know what you are doing): //:Deployment/:Deployment.Parts/*:AssemblyPart/#Name btw. If an attribute doesn't have a prefix it is in no namespace at all. As this is most often the case, I guess, you don't need local-name() for the attribute.

Parsing huge (~100mb) kml (xml) file taking hours without any sign of actual parsing