Convert xpath node back to html-markup in Go - xpath

import (
"fmt"
"gopkg.in/xmlpath.v2"
"log"
)
...
path := xmlpath.MustCompile("//div[#id='23']")
tree, err := xmlpath.ParseHTML(reader)
if err != nil {
log.Fatal("HTML parsing error, maybe not wellformed", err)
}
iter := path.Iter(tree)
for iter.Next() {
fmt.Println(iter.Node().String()) // returns only the values of the text-node
}
...
Is there a way to convert iter.Node() back to html markup like <div>...</div>? iter.Node().String() returns only the values of all inner text nodes. As far as I see the documentation of the xmlpath-package does not offer such function.

You are right - gopkg.in/xmlpath.v2 functions are limited to read content of nodes. And there is not many alternatives in Go to work with DOM.
From native Go libraries I can mention only goquery. It works only with HTML and does not support XPath but support CSS selectors. Maybe that would be enough in your case.
If you really need to work with both HTML and XML via XPath there is libxml wrapper for Go called gokogiri. It supports all features of libxml so you can get nodes, inner/outerHTML, attributes and other things. I used it to extract text content in one service which currently is in production state. It's a bit faster than PHP's DOMDocument. Only one limitation is fact that I'm not sure if it supports Go versions higher than 1.4.*. Oh and installation on Windows is a bit tricky.

I know this answer is to late, but still recommend these package written by native Go: xquery and xpath. it supports extract data or evaluate value from XML/HTML using XPath expression.

Related

Best way to marshal map to struct fields in GO

I want to know which is the best way to create instances of a certain struct based on a map[string]string
My app should process huge files in CSV format and should create an instance of a struct for each row of the file.
I'm already using the encoding/csv/Reader from golang to read the CSV file and create an instance of map[string]string for each row in the file.
So given this file:
columnA, columnB, columnC
a, b, c
My own reader implementation will return this map (each row values with the header):
myMap := map[string]string{
"columnA": "a",
"columnB": "b",
"columnC": "c",
}
(this is just an example in real life the file contains a lot of columns and rows)
so.. at this point I need to create an instance of the struct that is related with the row contents, let say:
type MyStruct struct {
AColumn string
BColumn string
CColumn string
}
My question is what could be the best way to create the instance of the struct using the given map, I have already implemented a version that just copy each value from the map to the struct but the code ended up being very long and tedious:
s := &MyStruct{}
s.AColumn := m["columnA"]
s.AColumn := m["columnB"]
s.AColumn := m["columnC"]
...
I also consider using this library https://github.com/mitchellh/mapstructure but I don't know if using reflection could be the best approach considering that the file is huge and will be using reflection for each row.
Maybe there is no other option but I'm asking just in case someone knows a better approach.
Thanks in advance.
I would say that the idiomatic Go way would be just populating the struct's fields from your map. Go favors explicitness this approach is the more direct and the easiest to read. In other words, your approach is correct.
You could make it slightly nicer by initializing the struct directly:
s := &MyStruct{
AColumn: m["columnA"],
BColumn: m["columnB"],
CColumn: m["columnC"],
}
Now, if your structure has 100s of fields (which is an odd design choice), you may want to leverage some code generation. Otherwise, just go with the straightforward code - it's the best approach in the long term.
I already posted a library that I made for some stuff I have needed sometimes, I've made a MapToStruct fews months ago, I pushed that today to share with you the full library. The library is based in the usage of reflect, I still testing and implementing stuff, you will find some odd comments and these kind of things.
https://github.com/FedeMFernandez/goscript
I Hope it is useful

Retrieve raw data from html.Node

I want to get contents of html.Node as a string.
Example:
<div id="my-node">
<p>First paragraph</p>
<p>Second paragraph</p>
</div>
Given myNode := html.Node("#my-node") (pseudocode), I want to retrieve entire above html as a string. Indentation does not matter.
I couldn't find anything on the internet except iterating over contents of node - myNode.NextSibling but its over complicated and I'm pretty sure there has to be easier way.
Update:
I'm reffering to golang.org/x/net/html package.
I get what you mean, I use a lot of this in tests.
What you need is already in the same x/net/html package - you can Render the Node to a bytes.Buffer then get a string out of it:
var b bytes.Buffer
err := html.Render(&b, node)
return b.String()
Please read
the doc
how rendering is done on the best effort basis - but it will probably fit you.
PS. You can consult how it's used in a more real project of mine:
https://github.com/wkhere/htmlx/blob/f22d01b/finder.go#L32-L39
https://github.com/wkhere/htmlx/blob/f22d01b/finder_test.go#L71-L73

Is there a way to ensure that all data in a yaml string was parsed?

For testing, I often see go code read byte slices, which are parsed into structs using yaml, for example here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/util/strategicpatch/patch_test.go#L74m
I just got bitten by not exporting my field names, resulting in an empty list which I iterated over in my test cases, thus assuming that all tests were passing (in hindsight, that should have been a red flag :)). There are other errors which are silently ignored by yaml unmarshaling, such as a key being misspelled and not matching a struct field exactly.
Is there a way to ensure that all the data in the byte slice was actually parsed into the struct returned by yaml.Unmarshal? If not, how do others handle this situation?
go-yaml/yaml
For anyone searching for a solution to this problem, the yaml.v2 library has an UnmarshalStrict method that returns an error if there are keys in the yaml document that have no corresponding fields in the go struct.
import yaml "gopkg.in/yaml.v2"
err := yaml.UnmarshalStrict(data, destinationStruct)
BurntSushi/toml
It's not part of the question, but I'd just like to document how to achieve something similar in toml:
You can find if there were any keys in the toml file that could not be decoded by using the metadata returned by the toml.decode function.
import "github.com/BurntSushi/toml"
metadata, err := toml.Decode(data, destinationStruct)
undecodedKeys := metadata.Undecoded()
Note that metadata.Undecoded() also returns keys that have not been decoded because of a Primitive value. You can read more about it here.
Json
The default go json library does not support this currently, but there is a proposal ready to be merged. It seems that it will be a part of go 1.10.

How to grab a piece of data which has a different xpath on different webpages?

So I am trying to grab a piece of data that is displayed in a different xpath on different pages.
if you will see the xpath of the IPA pronunction on wiktionary... https://en.wiktionary.org/wiki/foo you will see that the xpath is
//*[#id="mw-content-text"]/ul[1]/li[1]/span[4]
but if I got to another word, like https://en.wiktionary.org/wiki/bar then the xpath would be
//*[#id="mw-content-text"]/ul[1]/li[2]/span[5]
I cannot think of any way to reconcile these, is there something that I am missing?
The answer is simple. Never let a tool write any XPath for you. All tools get it wrong.
Look at the document's HTML source and write the appropriate XPath it yourself.
var result = document.evaluate("//*[#class = 'IPA']", document),
elem;
while (elem = result.iterateNext()) {
console.log(elem);
}
The above shows the simplest variant. It selects two occurrences of <span class="IPA"> on https://en.wiktionary.org/wiki/foo and quite a few more on https://en.wiktionary.org/wiki/bar.
Use a more specific expression to narrow down the results.

Select default namespace in XPath with HtmlUnit

I want to parse a Feedburner feed with HtmlUnit.
The feed is this one: http://feeds.feedburner.com/alcoanewsreleases
From this feed I want to read all item nodes, so normally a //item XPath should do the trick. Unfortunately that does not work in this case.
groovy code snippet:
def page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases")
def elements = page.getByXPath("//item")
Sample of the XML feed:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss1full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
[...SNIP...]
<item rdf:about="http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2011&pageID=20110518006002en">
<title>Chris L. Ayers Named President, Alcoa Global Primary Products</title>
<dc:date>2011-05-18</dc:date
<link>http://feedproxy.google.com/~r/alcoanewsreleases/~3/PawvdhpJrkc/news_detail.asp</link>
<description>NEW YORK--(BUSINESS WIRE)--Alcoa (NYSE:AA) announced today that Chris L. Ayers has been named President of Alcoa’s Global Primary Products (GPP) business, effective May 18, 2011. Ayers, previously Chief Operating Officer of GPP, succeeds John Thuestad, who will be handling special projects for the Company. Ayers joined Alcoa in February 2010 as Chief Operating Officer of Alcoa Cast, Forged and Extruded Products, a new position. He was elected a Vice President of Alcoa in April 2010 and Executive</description>
<feedburner:origLink xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2010&pageID=20100104006194en</feedburner:origLink>
</item>
[...SNIP...]
</rdf:RDF>
I suspect this to be an issue with the namespaces because this document has 4 namespaces. The namespaces are
(this is the default) xmlns="http://purl.org/rss/1.0/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"
I have tried to use Nokogiri with this (another XML Parser that I use for ruby scripts).
With Nokogiri I could just us the XPath //xmlns:item which works and returns all nodes from the feed.
I have tried the same XPath with HtmlUnit but it does not work.
So I think I can phrase my question as:
How can I select a node from the default namespace with HtmlUnit?
Any ideas?
From this feed I want to read all item
nodes, so normally a //item XPath
should do the trick. Unfortunately
that does not work in this case.
In XPath, that means "select all elements whose local name is item that are in no namespace". In RSS, the item elements must be in a namespace. So the above should never work with a conforming XML parser and XPath engine.
What's confusing is that in XML, <item> means "an element named item that is in the default namespace, i.e. whatever default namespace is in scope at this place in the document;" whereas in XPath, "item" means an element in no namespace. (Or, you could say, it means an element in the default namespace, but unless you have a way to tell XPath what the default namespace is, the default namespace is no namespace. Usually (always?) in XPath 1.0 there is no way to declare the default namespace for XPath expressions.)
The other confusing thing to beginners is that the namespace prefix mappings in the source XML document are not considered significant by the XPath processor. When the XML document is parsed, a data structure is built that remembers the name and namespace of every element (and other nodes). The namespace prefixes used, including the empty prefix of the default namespace, are considered mere syntactic convenience. More on this below...
With Nokogiri I could just us the
XPath //xmlns:item which works and
returns all nodes from the feed.
Whatever that is, it's not XPath. Maybe it's a Nokogiri extension to it (a very convenient one, but its syntax is really counter-intuitive).
So I think I can phrase my question
as: How can I select a node from the
default namespace with HtmlUnit?
Let's phrase it as: How can I select the RSS item elements with HtmlUnit? I phrase it that way because the RSS spec (actually in general any conforming XML vocabulary spec) does not require that its elements will be in the default namespace. That happens to be true in the sample you received, but the service provider could change that tomorrow and still be perfectly conformant to RSS. Tomorrow, the service provider could use the "rss" namespace prefix for that namespace; or any other arbitrary prefix. What RSS does specify is what namespace its elements will be in: the namespace whose URI is http://purl.org/rss/1.0/.
It's kind of like asking, "How do I write a function (in Javascript, C, Java, etc.) that can tell me the value of the variable a?" Usually a function has no idea what variable name was used for what in the caller. All it knows are the values of its arguments. If you call sqrt(4), you'll get the same answer as with a = 4; sqrt(a) or rumpelstiltzkin = 4; sqrt(rumpelstiltzkin). Clearly, the name of the variable argument has no direct effect on the result of the function call. It just needs to be the name of a variable that holds the right value. If a compiler complained because you wrote b = 4; return sqrt(b) instead of using a, you'd think that compiler was nuts. It's not supposed to care about variable names as long as you use valid identifiers.
In the same way, when processing RSS, we're not supposed to care about what namespace prefix is used, as long as it's a prefix that identifies the right namespace. It could be no prefix (which identifies the default namespace).
In XPath 2.0, you can wildcard the namespace. This is very handy if you know you're not going to need namespaces for disambiguation. In that case you can select //*:item. However, I don't think HTMLUnit supports XPath 2.0. Also in XPath 2.0 environments like XSLT 2.0, you can specify a default namespace for XPath expressions, but that won't help you in HTMLUnit.
So you have a couple of choices:
Use an XPath expression that ignores namespaces, such as //*[local-name() = 'item'].
or
The robust way: Register a namespace prefix for http://purl.org/rss/1.0/ and use it in your XPath expression: //rss:item. The question then becomes, how do you register a namespace prefix in HTMLUnit and pass it to the XPath processor? I took a quick look in the docs and didn't find any facility for doing that.
Caveat: I should add that the above is in regard to conforming XPath processors. I have no idea what XPath processor HTMLUnit uses. There are some XPath processors out there that ignore the specs and make the world more confusing for everybody.
I saw here that someone used the following syntax for elements in the default namespace in HTMLUnit:
//:item
But I wouldn't recommend that, for three reasons:
It's not valid XPath, so you can't expect it to work with other programs.
It will only work on RSS feeds that declare the RSS namespace to be the default namespace. RSS feeds that use a namespace prefix will cause the above to fail.
It will hold you back from learning how XML namespaces really work, and it will help preserve the status quo of tools that don't adequately support namespaces.
HTMLUnit is primarily designed for HTML, so incomplete handling of XML is understandable. But claiming to support XPath and then not providing ways to declare namespace prefixes is a bug. HTMLUnit uses an XPath package that seems to be part of Xalan-J. That package has ways to provide namespace mappings to XPath, but I don't know if HTMLUnit exposes that functionality.
This sounds familiar enough that I'm quite sure I've used namespaces and XPath successfully with HtmlUnit in the past, but of course I can't find the code. I suspect it must have been with HTML pages only: the page reference in your example is an XmlPage which has a number of methods specific to namespaces, all of which throw a "not implemented yet" exception when used. :-(
The current version (2.8) of HtmlUnit is nearly a year old, so it may be that some work has been done in the meantime to support XML namespaces. The "HtmlUnit Users" mailing list would be the place to find out.
In the meantime, as always there is a workaround:
final XmlPage page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases");
// no good
List elements = page.getByXPath("//item");
System.out.println( elements.size() ) ;
// ugly, but it works
DomElement de = (DomElement)page.getFirstByXPath( "//rdf:RDF" );
List<DomNode> items = new ArrayList<DomNode>() ;
for( DomNode dn : de.getChildNodes() )
{
String name = dn.getLocalName() ;
if( ( name != null ) && ( name.equals( "item" ) ) )
items.add( dn ) ;
}
System.out.println( "found " + items.size() ) ;
Oh boy Java is painful after working in Scala... ;-)

Resources