Retrieve raw data from html.Node - go

I want to get contents of html.Node as a string.
Example:
<div id="my-node">
<p>First paragraph</p>
<p>Second paragraph</p>
</div>
Given myNode := html.Node("#my-node") (pseudocode), I want to retrieve entire above html as a string. Indentation does not matter.
I couldn't find anything on the internet except iterating over contents of node - myNode.NextSibling but its over complicated and I'm pretty sure there has to be easier way.
Update:
I'm reffering to golang.org/x/net/html package.

I get what you mean, I use a lot of this in tests.
What you need is already in the same x/net/html package - you can Render the Node to a bytes.Buffer then get a string out of it:
var b bytes.Buffer
err := html.Render(&b, node)
return b.String()
Please read
the doc
how rendering is done on the best effort basis - but it will probably fit you.
PS. You can consult how it's used in a more real project of mine:
https://github.com/wkhere/htmlx/blob/f22d01b/finder.go#L32-L39
https://github.com/wkhere/htmlx/blob/f22d01b/finder_test.go#L71-L73

Related

Customising Pandoc writer element output

Is it possible to customise element outputs for a pandoc writer?
Given reStructuredText input
.. topic:: Topic Title
Content in the topic
Using the HTML writer, Pandoc will generate
<div class="topic">
<p><strong>Topic Title</strong></p>
<p>Content in the topic</p>
</div>
Is there a supported way to change the html output? Say, <strong> to <mark>. Or adding another class the parent <div>.
edit: I've assumed the formatting is the responsibility of the writer, but it's also possible it's decided when the AST is created.
This is what pandoc filters are for. Possibly the easiest way is to use Lua filters, as those are built into pandoc and don't require additional software to be installed.
The basic idea is that you'd match on an AST element created from the input, and produce raw output for your target format. So if all Strong elements were to be output as <mark> in HTML, you'd write
function Strong (element)
-- the result will be the element's contents, which will no longer be 'strong'
local result = element.content
-- wrap contents in `<mark>` element
result:insert(1, pandoc.RawInline('html', '<mark>'))
result:insert(pandoc.RawInline('html', '</mark>'))
return result
end
You'd usually want to inspect pandoc's internal representation by running pandoc --to=native YOUR_FILE.rst. This makes it easier to write a filter.
There is a similar question on the pandoc-discuss mailing list; it deals with LaTeX output, but is also about handling of custom rst elements. You might find it instructional.
Nota bene: the above can be shortened by using a feature of pandoc that outputs spans and divs with a class of a known HTML element as that element:
function Strong (element)
return pandoc.Span(element.content, {class = 'mark'})
end
But I think it's easier to look at the general case first.

How to open YAML file, change something and save it back in Go?

I need to change some values in YAML file from Go code. In my case, I need to change values.yaml file from Helm chart. Since that file can change, I do not structure of the whole file in advance (for example developers added new YAML sections in it in various projects). I just know how section that I want to change looks like.
I understand there is YAML library in Go (https://github.com/go-yaml/yaml). It will not do the job, because it assumes I know in advance structure of the file that I need to change. All examples are something like:
1. create struct
2. unmarshal YAML to struct
3. change
4. marshal and save back
It is not working for me since I do not know exact format of file, hence I cannot do step 1, create struct.
This is part of YAML file I am trying to change:
image:
repository: nginx
tag: stable
pullPolicy: IfNotPresent
I understand this can be done with help of interface{}, but I do not understand how. Assuming that I understand struct, marshal/unmarshal YAML files, can someone provide code that will:
1. Load YAML file that has at least 20 entries in it and is of unknown structure
2. Change only 1 entry (in my case I want to change tag number for image section)
3. Save it back.
Thanks a lot !
Something like this should work:
data, err := ioutil.ReadFile(file)
var v interface{}
err = yaml.Unmarshal(data, &v)
img, ok := v.["image"].(map[interface{}]interface{})
if ok {
img["tag"] = "somevalue"
}
The yaml library I use unmarshals into map[interface{}]interface{}. You need to add the necessary error checking, type assertions, etc.
When done, you can yaml.Marshal(v) and write the result.

Convert xpath node back to html-markup in Go

import (
"fmt"
"gopkg.in/xmlpath.v2"
"log"
)
...
path := xmlpath.MustCompile("//div[#id='23']")
tree, err := xmlpath.ParseHTML(reader)
if err != nil {
log.Fatal("HTML parsing error, maybe not wellformed", err)
}
iter := path.Iter(tree)
for iter.Next() {
fmt.Println(iter.Node().String()) // returns only the values of the text-node
}
...
Is there a way to convert iter.Node() back to html markup like <div>...</div>? iter.Node().String() returns only the values of all inner text nodes. As far as I see the documentation of the xmlpath-package does not offer such function.
You are right - gopkg.in/xmlpath.v2 functions are limited to read content of nodes. And there is not many alternatives in Go to work with DOM.
From native Go libraries I can mention only goquery. It works only with HTML and does not support XPath but support CSS selectors. Maybe that would be enough in your case.
If you really need to work with both HTML and XML via XPath there is libxml wrapper for Go called gokogiri. It supports all features of libxml so you can get nodes, inner/outerHTML, attributes and other things. I used it to extract text content in one service which currently is in production state. It's a bit faster than PHP's DOMDocument. Only one limitation is fact that I'm not sure if it supports Go versions higher than 1.4.*. Oh and installation on Windows is a bit tricky.
I know this answer is to late, but still recommend these package written by native Go: xquery and xpath. it supports extract data or evaluate value from XML/HTML using XPath expression.

Parsing & replacing multiple links but not when one contains an other

I can't figure out how to (easily) avoid link (2) to replace the beginning of link (1). I'd appreciate an answer in Ruby but if you figure out the logic it's good too.
The output should be:
message = "For Last Minute rentals, please go to:
<span class='external_link' href-web='http://www.mydomain.com/thepage'>http://www.mydomain.com/thepage</span> (1)
For more information about our events, please visit our website:
<span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span> (2)"
But it is:
message = "For Last Minute rentals, please go to:
<span class='external_link' href-web='<span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span>/thepage'><span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span>/thepage</span> (1)
For more information about our events, please visit our website:
<span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span> (2)"
Here's the code (edited: took out the spans):
message = "For Last Minute rentals, please go to:
http://www.mydomain.com/thepage
For more information about our events, please visit our website:
http://www.mydomain.com"
links_found = URI.extract(message, ['http', 'https'])
for link_found in links_found
message.gsub!(link_found,"<span class='external_link' href-web='#{web_link}'>#{link_found}</span>")
end
Thoughts?
I would guess that your problem is related to URI.extract. When it goes through message it's pulling all the instances of "http", which, for the first line, would be both "http" inside and outside the <span>.
To further clarify, links_found would be an array with both <span...href-web:... and http...</span>. Since you're only passing link_found to gsub as the pattern to match, it will replace every object in the links_found[] array
First, rule one, don't bother with string manipulation or regular expressions for anything but the most trivial things when dealing with HTML or XML. Doing otherwise is a sure recipe for madness.
Instead, save your sanity and go for a real parser. For Ruby I strongly suggest you look at Nokogiri only - it just works.
Consider this code:
require 'nokogiri'
message = "For Last Minute rentals, please go to:
<span class='external_link' href-web='http://www.mydomain.com/thepage'>http://www.mydomain.com/thepage</span> (1)
For more information about our events, please visit our website:
<span class='external_link' href-web='http://www.mydomain.com'>http://www.mydomain.com</span> (2)"
doc = Nokogiri::HTML(message)
external_spans = doc.search('span.external_link')
url1 = external_spans[0]['href-web'] # => "http://www.mydomain.com/thepage"
text1 = external_spans[0].text # => "http://www.mydomain.com/thepage"
url2 = external_spans[1]['href-web'] # => "http://www.mydomain.com"
text2 = external_spans[1].text # => "http://www.mydomain.com"
url and text1 are the URLs from span 1 and url2 and text2 are from span 2 respectively.
I'm not sure what you want to do with them, because, after a more-than-cursory glance I couldn't see a difference in your source and desired output, but, once you have them you're pretty much free to do anything. A parser, like Nokogiri, lets you retrieve information from the HTML or XML DOM, replace it, move things around, or even splice in new stuff.

Is there such a thing as a valid HTML5 fragment?

I obviously can't determine whether a fragment of HTML is valid without knowing what the rest of the document looks like (at a minimum, I would need a doctype in order to know which rules I'm validating against). But given the following HTML5 fragment:
<article><header></article>My header</header><p>My text</p></article>
I can certainly determine that it is invalid without seeing the rest of the document. So, is there such a thing as "provisionally valid" HTML, or "valid providing it fits into a certain place in a valid document"?
Is there more to it than the following pseudocode?
def is_valid_fragment(fragment):
tmp = "<!doctype html><html><head><title></title></head><body>" + fragment + "</body></html>"
return my_HTML5_validator.is_valid_html5_document(tmp)
You can certainly talk about an XML document weing well-formed, and you can construct a document from any single element and its children. You could thus talk about singly-rooted XHTML5 fragments being well-formed. You could deal with a multiply-rooted fragment (like <img/><img/>) by dealing with it as a sequence of documents, or wrapping it in some synthetic container element - since we're only talking about well-formedness, that would be okay.
However, HTML5 still allows the SGML self-closing tags, like <hr> and so on, whose self-closingness can only be determined by appeal to the doctype. For instance, <div><hr></div> is okay, but <div><tr></div> is not. If you were dealing with DOM nodes rather than text as input, this would be a nonissue, but if you have text, you'd need a parser which knows enough about HTML to be able to deal with those elements. Beyond that, though, some very simple rules, lifted directly from XML, would be enough to handle well-formedness.
If you wanted to go beyond well-formedness and look at some aspects of validity, i think you can still do that at the singly-rooted fragment level with XML. As the spec says:
An XML document is valid if it has an associated document type declaration and if the document complies with the constraints expressed in it.
A DTD can name any element as the root, and the mechanics then take care of checking the relationship between that element and its children, and their children and so on, and the various other constraints that make up validity.
Again, you can transfer that idea directly to HTML. I don't know how you deal with multiply-rooted fragments, though. And bear in mind that certain whole-document constraints (like IDs being unique) might hold inside the fragment, but not in an otherwise valid document once the fragment has been inserted into it.
Depending on what you intend to do with this verification, I think you should keep in mind that browsers are extremely forgiving regarding malformed HTML!
The invalid HTML string that you give in your example would work perfectly fine in (most if not all) browers:
const serializedHTML = "<article><header></article>My header</header><p>My text</p></article>"
const range = document.createRange()
const fragment = range.createContextualFragment(serializedHTML)
console.log(fragment)
The content of the fragment defined in the snippet above would result in the following DOM tree:
<article>
<header></header>
</article>
"My header"
<p>My text</p>
A crude method would be to check whether passing the fragment through the innerHTML of another element changes the text by doing something like the code below.
<html>
<head>
</head>
<script>
function validateHTML(htmlFragment) {
var testDiv = document.getElementById('testDiv')
testDiv.innerHTML = htmlFragment
var res = htmlFragment==testDiv.innerHTML
testDiv.innerHTML = ""
return res
}
</script>
<body>
<div id=testDiv style='display:none'></div>
<textarea id=txtElem onKeyUp="this.style.backgroundColor = validateHTML(this.value) ? '' : '#f00'"></textarea>
</body>
</html>
You could check if it is well-formed.

Resources