<?xml version="1.0" encoding="UTF-8"?> <kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document> <Placemark> <Name>Test Name</Name> <Description><b>Project Information</b><br><ul><li>Project Name: Test Name</li><li>Project Number: Test Number</li><li>Project Location: Test Location</li><li>System: Test System</li></ul><br><b>Project Team</b><br><br><ul><li>Regional Manager: Mem 1</li><li>Project Manager: Mem 2</li></ul><br>YouTube Video URL: <a href="http://youtu.be/U9EYP9GIe2k"><br>Picassa Album URL: <a href="www.picassa.com"><br></Description> <Point> <Coordinates>30,-125</Coordinates>,0 </Point> </Placemark> </Document> </kml>
This is what my custom Excel macro is generating (I'm new to programming, so take it easy on me if you notice something big). When I attempt to open the KML file with Google Earth, I get the following message: Open of file "file path" failed: Parse error at line 2, column 454: mismatched tag. This correlates to the /Description tag... What is wrong with this tag? I matches up with it's corresponding Description tag.
There are a handful of techniques you can apply to debug and repair a corrupt KML file.
Basically, the quickest way to validate a KML file is first using your web browser. KML is an XML file so first you can test if it's a well-formed XML file, which is a prerequisite to it being a valid KML file. Simply rename the KML file adding an .xml file extension then drag the file onto a web browser (Firefox, Chrome, etc.) to validate it. See detailed example here.
Once those errors are found and fixed then you can try a KML validator that checks if the file is valid KML with respect to the OGC KML Specification and associated XML Schema such as
the standalone command-line XmlValidator tool.
In your example, if you run it through a simple XML SAX parser it shows: element type "br" must be terminated by the matching end-tag "</br>" at column 455.
Error being that <description> element has HTML markup but isn't escaped with a CDATA block (CDATA is part of the XML standard). To fix this you need to reformat your KML like this:
<description>
<![CDATA[
<b>Project Information</b>
...
<br>
]]>
</description>
Also, the element has the wrong name (Description vs description). KML is case-sensitive.
More tips to debug KML files can be found here.
Related
I'm a professional indexer new to Ruby and nokogiri and I am in need of some assistance.
I'm working on a set of macros that will allow me to take an XML file, output from my indexing software, and parse it into valid \index{} commands for inclusion in a LaTeX source file. Each XML <record> contains at least two <field> tags, so I will have to iterate over the multiple <field> tags to build my \index{} entry.
The following is an example of an index record from the xml file.
<record time="2022-08-27T17:25:12" id="30">
<field><text style="i"/><hide>SS </hide>Titanic<text/></field>
<field>passengers</field>
<field class="locator"><text style="b"/>5<text/></field>
</record>
I will produce intermediate output of this record in the form of:
\index{Titanic#\textit{SS Titanic}!passengers|textbf} 5
(The numeric locator is used to place the \index{} entry at the correct spot in the LaTex file and won't be included in the LaTeX source file)
I am using nokogiri to manipulate the xml file and have been able to reach the point where I return a nodelist that contains just the <field> tags for each <record>, but I need to be able to retrieve all the text in the <field>, including the formatting information (if I use the text method on a <field>, it returns "SS Titanic" for example, with all formatting information stripped away).
I'm stuck on how to access the entire text string in the <field> tag. Once I can get that, I have a good idea of how to structure my parser.
Any help will be greatly appreciated.
does this help?
xml = "<record time="2022-08-27T17:25:12" id="30">
<field><text style="i"/><hide>SS </hide>Titanic<text/></field>
<field>passengers</field>
<field class="locator"><text style="b"/>5<text/></field>
</record>"
fields = Nokogiri::XML(xml).xpath(".//field")
puts fields.first.text #=> "SS Titanic"
puts fields.map(&:text) #=> ["SS Titanic", "passengers", "5"]
It's been the whole day that I'm trying to figure out how to parse USPTO bulk XML files. I've tried to download one of those files, unzipped it and then run:
Nokogiri::XML(File.open('ipg140513.xml'))
But it seems to load only the first element, not all patents (in that file there are few thousands)
What am I doing wrong?
The file you linked to, and presumably the others, are not valid XML because they do not have a root element. From Wikipedia:
Each XML document has exactly one single root element.
Nokogiri hints at this if you look at the errors (suggested by Arup Rakshit), as detailed in the documentation:
Nokogiri::XML(File.open("/Users/b/Downloads/ipg140513.xml")).errors # =>
# [
# #<Nokogiri::XML::SyntaxError: XML declaration allowed only at the start of the document>,
# #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>
# ]
The file appears to be a concatenation of a series of valid XML files, each having a <us-patent-grant/> as its root element.
Fortunately, Nokogiri can handle this invalid XML if you process it as a document fragment. Try this:
Nokogiri::XML::DocumentFragment.parse(File.read('ipg140513.xml')).select{|element| element.name == 'us-patent-grant'}
The select chooses the root node of each concatenated document, ignoring the processing instructions and DTD declarations.
Alternately, you could pre-process the file and split it into its constituent, correctly-formatted documents. Parsing a 650MB document all at once is quite slow and memory intensive.
I created some XML in notepad (figuring any extra characters or formatting would be stripped = plain string) and then pasted it into Fiddler. It looks like this:
<MyValue1>AB</MyValue1>
<MyValue2>BlahBlah</MyValue2>
When I inspect the raw message in my Web API service I see notation like the following:
<MyValue1>AB</MyValue1>\r\n\t\t<MyValue2>BlahBlah</MyValue2>\r\n\t\t
Notice the \r\n\t\t?
If I go back to Fiddler and make the XML into a single line string as opposed to a formatted and indented XML document, then I do not see those characters.
How do I make it so those line breaks are not a part of the XML being POSTed to my service without having to make a single line of XML in Fiddler?
I need to read in a file which will be in xml format but all crammed into a single line, and I need to parse that line to find a specific property and replace its value with something I have specified.
The file might contain:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><VerificationPoint type="Screenshot" version="2"><Description/><Verification object=":qP1B11_QLabel" type="PNG">
I need to search through this line, find the property "Verification object=" and replace the :qP1B11 with my own string. Please not that I don't want to replace the _QLabel" type="PNG"> part of the string if possible.
I can't use sub as I don't value of the property which could be anything, and I believe I should be able to do this with Regular Expressions but I have never had to use them before and all examples I've seen just make me more confused than earlier.
If anyone can present me with an elegant answer (and an explanation if using regexp) it would be a huge help!
Thanks
You have XML so use an XML parser. Nokogiri will make short work of that:
doc = Nokogiri::XML(that_string)
doc.search('Verification').each do |node|
node['object'] = node['object'].sub(/:qP1B11/, 'PANCAKES')
end
new_string = doc.to_xml
# <?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<VerificationPoint type="Screenshot" version="2">\n <Description/>\n <Verification object="PANCAKES_QLabel" type="PNG">\n</Verification>\n</VerificationPoint>\n"
You can adjust the output format using the options for to_xml.
If you only have one <Verification> then you could do it like this:
node = doc.at('Verification')
node['object'] = node['object'].sub(/:qP1B11/, 'PANCAKES')
new_string = doc.to_xml
In either case you'd adjust your regex and replacement to suit your needs.
Can sitemap.xml precessors cope with this ?
<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY port ":8080">
<!ENTITY host"http://example.com&port;">
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>&host;/path/</loc>
<!-- ...
I assume so. It will most likely just ignore it though. If there is no Sitemaps DTD, I think it has to ignore it unless it expects it.
From Wikipedia:
In the markup languages SGML, HTML, XHTML and XML, a character entity reference is a reference to a particular kind of named entity that has been predefined or explicitly declared in a Document Type Definition (DTD). The "replacement text" of the entity consists of a single character from the Universal Character Set/Unicode. The purpose of a character entity reference is to provide a way to refer to a character that is not universally encodable.
In short, no. Not unless the preprocessor is very forgiving.