Unmarshal with global namespace - go

I have the following XML:
<rss version="2.0">
<channel>
...
<item>
<link>http://stackoverflow.com</link>
<atom:link xmlns:atom="http://www.w3.org/2005/Atom" href="http://stackoverflow.com"/>
...
</item>
</channel>
</rss>
I want to extract the link attribute, I have the following struct:
type Item struct {
Link string `xml:"http://www.w3.org/2005/Atom link"`
}
I know, that I need a prefix to get the Link, but because there is no namespace given (in form of an xmls-Attribute, but I don't know, how.
I could, of course, save all :*link-Attributes to a slice, but I'm sure there is a better solution.
Thanks in advance!

The namespace handling in the standard library encoding/xml package seems to be a big ad-hoc, and having elements in different namespaces with the same name seems to be a trigger.
Ideally you'd be able to decode the given XML into the following structures:
type Rss struct {
Items []Item `xml:"channel>item"`
}
type Item struct {
Link string `xml:"link"`
AtomLink AtomLink `xml:"http://www.w3.org/2005/Atom link"`
}
type AtomLink struct {
Href string `xml:"href,attr"`
}
But this results in the error main.Item field "Link" with tag "link" conflicts with field "AtomLink" with tag "http://www.w3.org/2005/Atom link" (as seen in http://play.golang.org/p/LgW-vm4euL).
However, if we decide that we want to ignore the <atom:link> element by commenting out the Item.AtomLink field, we end up decoding an empty string, since xml:"link" matches <link> elements in any namespace rather than just the blank namespace. The final <atom:link> element is empty, so doesn't return anything.
A couple of possible work arounds include:
Only try to decode the <atom:link> element, since it can be selected uniquely. This may not be useful if you're also processing RSS feeds without Atom namespace elements.
Collect the contents of all <link> elements by modifying the Item struct to use:
Links []string `xml:"link"`
And then discard any empty strings in the slice.
At the end of the day, the package will need some way to refer to the blank namespace. That may require new syntax in order to keep existing programs functioning though.

Related

Lookup field by tag

Consider the following struct
type Test struct {
A string `t1:"x"`,
B string `t1:"y"`,
}
Using the reflect package, is there any way for me to get "A" if I know that t1 tag has value "x"?
Using the reflect package, is there any way for me to get "A" if I know that t1 tag has value "x"?
Not a direct one.
You must iterate over all fields and check if the field has the appropriate tag.
(Note that two fields may have the same tag, so looking up by tag would not really work.)

XQuery/Xpath referring to xml elements with no namespace, in a namespace environment

In Xquery 3.1 (under eXist-DB 4.7) I receive xml data like this, with no namespace:
<edit-request id="TC9999">
<title-collection>foocolltitle</title-collection>
<title-exempla>fooextitle</title-exempla>
<title-short>fooshorttitle</title-short>
</edit-request>
This is assigned to a variable $content and this statement:
let $collid := $content/edit-request/#id
...correctly returns: TC9999
Now, I need to actually transform all the data in $content into a TEI xml document.
I first need to get some info from an existing TEI file, so I assigned another variable:
let $oldcontent := doc(concat($globalvar:URIdata,$collid,"/",$collid,".xml"))
And then I create the new TEI document, referring to both $content and $oldcontent:
let $xml := <listBibl xmlns="http://www.tei-c.org/ns/1.0"
type="collection"
xml:id="{$collid}">
<bibl>
<idno type="old_sql_id">{$oldcontent//tei:idno[#type="old_sql_id"]/text()}</idno>
<title type="collection">{$content//title-exempla/text()}</title>
</bibl>
</listBibl>
The references to the TEI namespace in $oldcontent come through, but to my surprise the references to $content (no namespace) don't show up:
<listBibl xmlns="http://www.tei-c.org/ns/1.0"
type="collection"
xml:id="TC9999">
<bibl>
<idno type="old_sql_id">1</idno>
<title type="collection"/>
</bibl>
</listBibl>
The question is: how do I refer to the non-namespace elements in $content in the context of let $xml=...?
Nb: the Xquery document has a declaration at the top (as it is the principle namespace of virtually all the documents):
declare namespace tei = "http://www.tei-c.org/ns/1.0";
In essence you are asking how to write an XPath expression to select nodes in an empty namespace in a context where the default element namespace is non-empty. One of the most direct solutions is to use the "URI plus local-name syntax" for writing QNames. Here is an example:
xquery version "3.1";
let $x := <x><y>Jbrehr</y></x>
return
<p xmlns="foo">Hey there,
{ $x/Q{}y => string() }!</p>
If instead of $x/Q{}y the example had used the more common form of the path expression, $x/y, its result would have been an empty sequence, since the local name y used to select the <y> element specifies no namespace and thus inherits the foo element namespace from its context. By using the "URI plus local-name syntax", though, we are able to specify the empty namespace we are looking for.
For more information on this, see the XPath 3.1 specification's discussion of expanded QNames: https://www.w3.org/TR/xpath-31/#doc-xpath31-EQName.

Go XML suppression of automatically generated tags?

I'm trying to implement an XML format under Go that was originally written in Fortran. The format is already specified so I'm not free to make changes to the standard. Unfortunately, the format includes data that is not enclosed by an XML tag, thus I would like to suppress the automatic tag creation provided by xml.Marshal.
I've investigated all the standard option associated with marshalling,as documented at : https://golang.org/pkg/encoding/xml/
By default marshalling will use the structure variable name, which can be overridden by the xml: definition. As far as I can tell there is no definition that suppresses the tag name.
type SAO_FREQUENCY_LIST struct {
Type string `xml:",attr"`
SigFig int `xml:",attr"`
Units string `xml:",attr"`
Description string `xml:",attr"`
Frequencies string `xml:""`
}
I want the XML output to be as follows:
<FrequencyList Type="float" SigFig="5" Units="MHz" Description="Nominal Frequency">
3.7 3.8
</FrequencyList>"
By default xml.MarshalIndent(..) yields:
<FrequencyList Type="float" SigFig="5" Units="MHz" Description="”Nominal Frequency">
<Frequencies>3.7 3.8</Frequencies>
</FrequencyList>
You can use the ,chardata modifier to indicate that the value of a struct member should be used as character data for the XML element. For your example, this would be:
type FrequencyList struct {
...
Frequencies string `xml:",chardata"`
}
You can experiment with an example using this change here: https://play.golang.org/p/oBa8HuE-57d

Parsing xml with Go, ignoring nested elements?

I am trying to parse a html document with the Golang xml parser. I have managed it to extract all the <li>elements but if the element contains a link <a>, then the content of the link is ignored. I would like to just ignore the nested <a> and display it's content as plain text but I don't know how.
Here is my code:
d := xml.NewDecoder(resp.Body)
d.Strict = false
d.AutoClose = xml.HTMLAutoClose
d.Entity = xml.HTMLEntity
type list_item struct {
Data string `xml:",chardata"`
}
for {
t,_ := d.Token()
if t == nil {
break
}
switch se := t.(type) {
case xml.StartElement:
if se.Name.Local == "li" {
var q list_item
d.DecodeElement(&q, &se)
c.Infof("%+v\n", q)
}
}
}
Is there any way to just ignore nested elements and display their content?
Constder using specialized package for parsing HTML. In general, HTML is not XML (XHTML 1.0 is, but documents formatted using it are not very common, and that standard has been deprecated).
An even better approach in my opinion—given your apparent use case,— would be using XPath to extract the necessary information using a query.
As to the question as stated, I think there's no built-in way to do what you want: the xml.Decoder implements the Skip() method but it only allows you to skip over unneeded content; there's nothing returning "inner XML" as is. You could roll this yourself by using xml.Decoder's RawToken(): by immediately rendering whatever it returns until it returns something denoting and end element you're looking for (you'll have to implement support for handling nested elements).
I found a library that uses the jQuery style of getting html information: http://godoc.org/github.com/PuerkitoBio/goquery
I used that and it solved the problem.

How to get node value using variable node name?

I have an XML document like:
<data>
<item type="apple">
<misc>something</misc>
<appleValue>23</appleValue>
<misc2>something else</misc2>
</item>
<item type="banana">
<bananaValue>47</bananaValue>
<random>something</random>
</item>
</data>
I can get the items with doc("data.xml")/data/item but I need to get the text from the elements that end with Value. So I'd like to get "23" and "47", but I don't necessarily know the element names, meaning all I really know is there are elements that end in Value, I don't know if it's appleValue, bananaValue, etc. except that I could look at the type attribute and buildup a string.
let $type := (doc("data.xml")/data/item)[1]/#type
doc("data.xml")/data/item/$typeValue
...That last line is what I'm trying to get at, clearly that's not correct but I need to find elements whose name is known based on a variable (stored in a variable such as $type) and "Value".
Any ideas? I realize this variable element naming is strange/odd/bad...but that's the way it is and I have to deal with it.
I got it thanks to this post: Can XPath match on parts of an element's name?
doc("data.xml")/data/item/*[ends-with(name(), "Value")]
I would avoid using the name() function in favor of either node-name() or local-name(). The reason for this is that name() can give you different answers depending on what (and whether) namespace prefixes are used in the source. For example, the following three elements have the same exact name (QName):
<appleValue xmlns="http://example.com"/>
<x:appleValue xmlns:x="http://example.com"/>
<y:appleValue xmlns:y="http://example.com"/>
However, the name() function will give you a different answer for each one (appleValue, x:appleValue, and y:appleValue, respectively). So you're better off either ignoring the namespace by using local-name() (which returns the string appleValue for all three of the above cases) or explicitly specifying the namespace (even if it's empty, as Oliver showed), using node-name() (which returns a proper QName value, rather than a string). In this case, since you're not using namespaces (and since even if you added one later, the code will still work), I'd be slightly in favor of using local-name() as follows:
doc("data.xml")/data/item/*['Value' eq substring-after(local-name(),../#type)]
For elaboration on reasons to avoid the name() function (and exceptions), see "Perils of the name function".
You can access the name of the node using name(). XPath 1.0 does not have an "ends-with" function, but by using substring() and string-length() - 1 you can get there.
//item/*[ substring( name(), string-length(name() ) - 4 ) = 'Value']
A more precise way to implement this would be
for $item in doc("data.xml")/data/item
let $value-name := fn:QName('', concat($item/#type, 'Value'))
return $item/*[node-name() = $value-name]

Resources