Can sitemap.xml precessors cope with <!ENTITY name "my text">? - sitemap

Can sitemap.xml precessors cope with this ?
<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY port ":8080">
<!ENTITY host"http://example.com&port;">
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>&host;/path/</loc>
<!-- ...

I assume so. It will most likely just ignore it though. If there is no Sitemaps DTD, I think it has to ignore it unless it expects it.

From Wikipedia:
In the markup languages SGML, HTML, XHTML and XML, a character entity reference is a reference to a particular kind of named entity that has been predefined or explicitly declared in a Document Type Definition (DTD). The "replacement text" of the entity consists of a single character from the Universal Character Set/Unicode. The purpose of a character entity reference is to provide a way to refer to a character that is not universally encodable.
In short, no. Not unless the preprocessor is very forgiving.

Related

XPath - How to get image source from xml

Hello i have this xml
<item>
<title> Something for title»</title>
<link>some url</link>
<description><![CDATA[<div class="feed-description"><div class="feed-image"><img src="pictureUrl.jpg" /></div>text for desc</div>]]></description>
<pubDate>Thu, 11 Jun 2015 16:50:16 +0300</pubDate>
</item>
I try to get the img src with path: //description//div[#class='feed-description']//div[#class='feed-image']//img/#src but it doesn't work
is there any solution?
A CDATA section escapes its contents. In other words, CDATA prevents its contents from being parsed as markup when the rest of the document is parsed. So the <div>s in there are not seen as XML elements, only as flat text. The <description> element has no element children ... only a single text child. As such, XPath can't select any <div> descendant of <description> because none exists in the parsed XML tree.
What to do?
If your XPath environment supports XPath 3.0, you could use parse-xml() to turn the flat text into a tree, then use XPath to select //div[#class='feed-description']//div[#class='feed-image']//img/#src from the resulting tree.
Otherwise, your best workaround may be to use primitive string-processing functions like substring-before(), substring-after(), or match(). (The latter uses regular expressions and requires XPath 2.0.) Of course, many people will tell you not to use regular expressions to analyze markup like XML and HTML. For good reason: in the general case, it's very difficult to do it right (with regexes or with plain string searches). But for very restricted cases where the input is highly predictable, and in absence of better tools, it can be the best tool for a less-than-ideal job.
For example, for the data shown in your question, you could use
substring-before(substring-after(//description, 'img src="'), '"')
In this case, the inner call substring-after(//description, 'img src="') returns pictureUrl.jpg" /></div>text for desc</div>, of which the substring before " is pictureUrl.jpg.
This isn't really robust, for example it'll fail if there's a space between src and =. But if the exact formatting is predictable, you'll be OK.

Need to understand - why CDATA section is treated as if the <![CDATA[ and ]]>?

I was reading a text book to learn XPath. And the below line I found from that book:
How does XPath handle text in XML CDATA sections? Each character within a CDATA section is treated as character data. In other words, a CDATA section is treated as if the <![CDATA[ and ]]> were removed and every occurrence of markup like < and & was replaced by the corresponding character entities like < and &.
But the book didn't give any examples to explain the above sentences. Can any one help me to understand what the Author tried to say in the below:
a CDATA section is treated as if the <![CDATA[ and ]]> were removed and every occurrence of markup like < and & was replaced by the corresponding character entities like < and &.
I think of it the other way round - everything between a <![CDATA[ and the next ]]> is treated as text, and not subject to the usual decoding of entity references, and < signs don't introduce element names. So
<something><![CDATA[<foo>text&more</foo>]]></something>
is the same as
<something><foo>text&more</foo></something>
whereas
<something><foo>text&more</foo></something>
is not well-formed XML (because the & is treated as the start of an entity reference but there's no corresponding ; to end it).

Debugging KML file

<?xml version="1.0" encoding="UTF-8"?> <kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document> <Placemark> <Name>Test Name</Name> <Description><b>Project Information</b><br><ul><li>Project Name: Test Name</li><li>Project Number: Test Number</li><li>Project Location: Test Location</li><li>System: Test System</li></ul><br><b>Project Team</b><br><br><ul><li>Regional Manager: Mem 1</li><li>Project Manager: Mem 2</li></ul><br>YouTube Video URL: <a href="http://youtu.be/U9EYP9GIe2k"><br>Picassa Album URL: <a href="www.picassa.com"><br></Description> <Point> <Coordinates>30,-125</Coordinates>,0 </Point> </Placemark> </Document> </kml>
This is what my custom Excel macro is generating (I'm new to programming, so take it easy on me if you notice something big). When I attempt to open the KML file with Google Earth, I get the following message: Open of file "file path" failed: Parse error at line 2, column 454: mismatched tag. This correlates to the /Description tag... What is wrong with this tag? I matches up with it's corresponding Description tag.
There are a handful of techniques you can apply to debug and repair a corrupt KML file.
Basically, the quickest way to validate a KML file is first using your web browser. KML is an XML file so first you can test if it's a well-formed XML file, which is a prerequisite to it being a valid KML file. Simply rename the KML file adding an .xml file extension then drag the file onto a web browser (Firefox, Chrome, etc.) to validate it. See detailed example here.
Once those errors are found and fixed then you can try a KML validator that checks if the file is valid KML with respect to the OGC KML Specification and associated XML Schema such as
the standalone command-line XmlValidator tool.
In your example, if you run it through a simple XML SAX parser it shows: element type "br" must be terminated by the matching end-tag "</br>" at column 455.
Error being that <description> element has HTML markup but isn't escaped with a CDATA block (CDATA is part of the XML standard). To fix this you need to reformat your KML like this:
<description>
<![CDATA[
<b>Project Information</b>
...
<br>
]]>
</description>
Also, the element has the wrong name (Description vs description). KML is case-sensitive.
More tips to debug KML files can be found here.

Substituting text in a file with Ruby

I need to read in a file which will be in xml format but all crammed into a single line, and I need to parse that line to find a specific property and replace its value with something I have specified.
The file might contain:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><VerificationPoint type="Screenshot" version="2"><Description/><Verification object=":qP1B11_QLabel" type="PNG">
I need to search through this line, find the property "Verification object=" and replace the :qP1B11 with my own string. Please not that I don't want to replace the _QLabel" type="PNG"> part of the string if possible.
I can't use sub as I don't value of the property which could be anything, and I believe I should be able to do this with Regular Expressions but I have never had to use them before and all examples I've seen just make me more confused than earlier.
If anyone can present me with an elegant answer (and an explanation if using regexp) it would be a huge help!
Thanks
You have XML so use an XML parser. Nokogiri will make short work of that:
doc = Nokogiri::XML(that_string)
doc.search('Verification').each do |node|
node['object'] = node['object'].sub(/:qP1B11/, 'PANCAKES')
end
new_string = doc.to_xml
# <?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<VerificationPoint type="Screenshot" version="2">\n <Description/>\n <Verification object="PANCAKES_QLabel" type="PNG">\n</Verification>\n</VerificationPoint>\n"
You can adjust the output format using the options for to_xml.
If you only have one <Verification> then you could do it like this:
node = doc.at('Verification')
node['object'] = node['object'].sub(/:qP1B11/, 'PANCAKES')
new_string = doc.to_xml
In either case you'd adjust your regex and replacement to suit your needs.

html 4.0 entities in XPATH queries

I don't know exactly why the xpath expression:
//h3[text()='Foo › Bar']
doesn't match:
<h3>Foo › Bar</h3>
Does that seem right? How do I query for that markup?
XPath does not define any special escape sequences. When XPath is used within XSLT (e.g. in attributes of elements of an XSLT document), the escape sequences are processed by the XML processor that reads the stylesheet. If you use XPath in non-XML context (e.g. from Java or C# or other language) via a library, and your XPath query is a string literal in that language, you won't get any escape processing aside from that which the language itself usually does.
If this is C# or Java, this should work:
String xpath = "//h3[text()='Foo \u8250 Bar']";
...
As a side note, it wouldn't work in XSLT either, as XSLT uses XML, which doesn't define a character entity › - it only defines <, >, ", &apos; and &. You'd have to either use 艐, or define the character entity yourself in DOCTYPE declaration of the XSLT stylesheet.
From the XPath specification:
XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax
… so unless you are using the query inside (as opposed to "to query") a language that resolves that entity (perhaps XSLT with a DTD that includes the entity (if that is possible, I'm far from an XSLT expert)), I wouldn't expect it to work.
Use a literal character or an escape sequence recognized by whatever language you are using XPath from.

Resources