Cast a Nokogiri::XML::Document to a Nokogiri::HTML::Document - ruby

I want to transform an XML document to HTML using XSL, tinker with it a little, then render it out. This is essentially what I'm doing:
source = Nokogiri::XML(File.read 'source.xml')
xsl = Nokogiri::XSLT(File.read 'transform.xsl')
transformed = xsl.transform(source)
html = Nokogiri::HTML(transformed.to_html)
html.title = 'Something computed'
Stylesheet::transform always returns XML::Document, but I need a HTML::Document instance to use methods like title=.
The code above works, but exporting and re-parsing as HTML is just awful. Since the target is a subclass of the source, there must be a more effective way to perform the conversion.
How can I clean up this mess?
As a side question, Nokogiri has generally underwhelmed me with its handling of doctypes, unawareness of <meta charset= etc... does anyone know of a less auto-magic library with similar capabilities?
Many thanks ;)

HTML::Document extends XML::Document, but the individual nodes in a HTML document are just plain XML::Nodes, i.e. there aren’t any HTML::Nodes. This suggests a way of converting an XML document to HTML by creating a new empty HTML::Document and setting its root to that of the XML document:
html = Nokogiri::HTML::Document.new
html.root= transformed.root
The new document has the HTML methods like title= and meta_encoding= available, and when serializing it creates a HTML document rather than HTML: adds a HTML doctype, correctly uses empty tags like <br>, displays minimized attributes where appropriate (e.g. <input type="checkbox" selected>) and doesn’t escape things like > in <script> blocks.

Related

Passing JSON as HTML element text

Would there be bad consequences from transporting JSON in HTML like this:
<div id="json" style="display: none;">{"foo": "bar"}</div>
assuming HTML chars such as < are escaped as < in the element text?
The JSON could be strictly parsed:
var blah = $.parseJSON($('#json').html())
in a try/catch statement, for example. The rationale is to enable passing of JSON in Ajax'd HTML responses, when script tags are being stripped an not executed. An example would be Ajax requests made using the jQuery .load() special selector syntax:
$('#here').load('some.html #fragment')
...which ditches all script tags and thus prevents the use of:
<script>var blah = {"foo":"bar"}</script>
I've seen JSON being passed around in HTML attributes, and I'd guess this is equivalent - w.r.t. weirdness, security, etc - but is far less readable due to all the additional quote-escaping.
The natural way of passing JS data in HTML is through JavaScript code (if is a part of actual JavaScript code, like in the case of initial values/configuration) or by data- HTML5 attributes (whenever JS code is not necessary; always when data needs to be somehow attached to DOM elements).
In your example this would be probably the best:
<div id="json" style="display: none;"
data-something="{"foo":"bar"}">
</div>
but reorganize your data to actually follow HTML structure:
<div class="profile-container"
data-profile="{"name":"John Doe","id":123}">
... profile 123 ...
</div>
<div class="profile-container"
data-profile="{"name":"Jane Doe","id":321}">
... profile 321 ...
</div>
(quoting should be done server-side, eg. using PHP's htmlspecialchars(...), or Python's cgi.escape(..., True)).
And then you can obtain the data in one of multiple ways, eg. using jQuery's .data() method.
EDIT:
Yes, your approach with embedding JSON as content of HTML tags and hiding it using CSS styles has gotchas. As I said, if you want to pass data in HTML, the only "best practice" way is to attach it to one of HTML elements (you are kind-of doing it anyway, but you use CSS to hide it, while you can use existing solutions for passing JSON/data without affecting clients that could override your styles). The proof for one of disadvantages is here: http://jsfiddle.net/NY7Bs/ (data is passed both ways, but one simple external style overrides your inline styles and shows the content - not mentioning the influence on semantics of your document).
Why not simply use the .ajax() function then, you would get only the string with the json. Then you could parse it as you suggested.

Handlebars template with "div" tag instead "script"

Actually the question is in the subj...
Is it possible to make handlebars template framework, to recognize templates within a div tag and not in script tag?
For example I would like to create template with this markup:
<style>
div.text-x-handlebars {display:none;}
</style>
<div class="text-x-handlebars-template">
<h2>I'm template</h2>
<p>{{welcomeMessage}}</p>
</div>
Yes you can put your templates in <div>s rather than <script>s, for example:
http://jsfiddle.net/ambiguous/RucqP/
However, doing so is fraught with danger. If you put your template inside a <div> then the browser will interpret it as HTML before you've filled it in; so, if your template's HTML isn't valid until after it has been filled in, the browser may attempt to correct it and make a mess of things. Also, if you have id attributes in your templates, then you will end up with duplicate ids (one in the template <div> and a repeat in the filled in template that you put in the DOM) and that will cause all sorts of strange and interesting bugs. The browser will also try to download any images inside the templates in a <div>, this may or may not be a problem (if might even be desirable but probably not desirable if the image uses a template variable in its src attribute).
Basically, you can do it but you shouldn't, you should put your templates in <script id="..." type="text/x-handlebars-template"> elements instead.

Convert HTML to plain text and maintain structure/formatting, with ruby

I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc.
The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images).
I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with more intelligence.
First, don't try to use regex for this. The odds are really good you'll come up with a fragile/brittle solution that will break with changes in the HTML or will be very hard to manage and maintain.
You can get part of the way there very quickly using Nokogiri to parse the HTML and extract the text:
require 'nokogiri'
html = '
<html>
<body>
<p>This is
some text.</p>
<p>This is some more text.</p>
<pre>
This is
preformatted
text.
</pre>
</body>
</html>
'
doc = Nokogiri::HTML(html)
puts doc.text
>> This is
>> some text.
>> This is some more text.
>>
>> This is
>> preformatted
>> text.
The reason this works is Nokogiri is returning the text nodes, which are basically the whitespace surrounding the tags, along with the text contained in the tags. If you do a pre-flight cleanup of the HTML using tidy you can sometimes get a lot nicer output.
The problem is when you compare the output of a parser, or any means of looking at the HTML, with what a browser displays. The browser is concerned with presenting the HTML in as pleasing way as possible, ignoring the fact that the HTML can be horribly malformed and broken. The parser is not designed to do that.
You can massage the HTML before extracting the content to remove extraneous line-breaks, like "\n", and "\r" followed by replacing <br> tags with line-breaks. There are many questions here on SO explaining how to replace tags with something else. I think the Nokogiri site also has that as one of the tutorials.
If you really want to do it right, you'll need to figure out what you want to do for <li> tags inside <ul> and <ol> tags, along with tables.
An alternate attack method would be to capture the output of one of the text browsers like lynx. Several years ago I needed to do text processing for keywords on websites that didn't use Meta-Keyword tags, and found one of the text-browsers that let me grab the rendered output that way. I don't have the source available so I can't check to see which one it was.

IE8 & FF XHTML error or badly formed span?

I recently have found a strange occurrence in IE8 & FF.
The designers where using js to dynamically create some span tags for layout (they were placing rounded corner graphics on some tabs). Now the xhtml, in js, looked like this: <span class=”leftcorner” /><span class=”rightcorner” /> and worked perfectly!
As we all know dynamically rendering elements in js can be quite processor intensive so I moved the elements from js into the page source, exactly as above.
... and it didn’t work... not only didn’t it work, it crashes IE8.The fix was simple, put the close span in ie: <span class=”leftcorner”></span>
I am a bit confused by this.
Firstly as far as I am aware <span class=”leftcorner” /> is perfectly valid XHTML!
Secondly it works dynamically, but not in XHTML?!?!?
Can anyone shed any light on this or is it simply another odd occurrence of browsers?
The major browsers only support a small subset of self-closing tags. (See this answer for a complete list.)
Depending on how you were creating the elements in JS, the JavaScript engine probably created a valid element to place in the DOM.
I had similar problem with a tags in IE.
The problem was my links looked like that (it was an icon set with the css, so I didn't need the text in it:
<a href="link" class="icon edit" />
Unfortunately in IE these links were not displayed at all. They have to be in
format (leaving empty text didn't work as well so I put there). So what I did is I add an few extra JS lines to fix it as I didn't want to change all my HTML just for this browser (ps. I'm using jQuery for my JS).
if ($.browser.msie) {
$('a.icon').html('&nbsp');
}
IE in particular does not support XHTML. That is, it will never apply proper XML parsing rules to a document - it will treat it as HTML even with proper DOCTYPE and all. XHTML is not always valid SGML, however. In some cases (such as <br/>) IE can figure it out because it's prepared to parse tagsoup, and not just valid SGML. However, in other cases, the same "tagsoup" behavior means that it won't treat /> as self-closing tag terminator.
In general, my advice is to just use HTML 4.01 Strict. That way you know exactly what to expect. And there's little point in feeding XHTML to browsers when they're treating it as HTML anyway...
See I think that one of the answers to Is writing self closing tags for elements not traditionally empty bad practice? will answer your question.
XHTML is only XHTML if it is served as application/xhtml+xml — otherwise, at least as far as browsers are concerned, it is HTML and treated as tag soup.
As a result, <span /> means "A span start tag" and not "A complete span element". (Technically it should mean "A span start tag and a greater than sign", but that is another story).
The XHTML spec tells you what you need to do to get your XHTML to parse as HTML.
One of the rules is "For non-empty elements, end tags are required". The list of elements includes a quick reference to which are empty and which are not.

FireFox script tag error [duplicate]

What is the reason browsers do not correctly recognize:
<script src="foobar.js" /> <!-- self-closing script element -->
Only this is recognized:
<script src="foobar.js"></script>
Does this break the concept of XHTML support?
Note: This statement is correct at least for all IE (6-8 beta 2).
The non-normative appendix ‘HTML Compatibility Guidelines’ of the XHTML 1 specification says:
С.3. Element Minimization and Empty Element Content
Given an empty instance of an element whose content model is not EMPTY (for example, an empty title or paragraph) do not use the minimized form (e.g. use <p> </p> and not <p />).
XHTML DTD specifies script elements as:
<!-- script statements, which may include CDATA sections -->
<!ELEMENT script (#PCDATA)>
To add to what Brad and squadette have said, the self-closing XML syntax <script /> actually is correct XML, but for it to work in practice, your web server also needs to send your documents as properly formed XML with an XML mimetype like application/xhtml+xml in the HTTP Content-Type header (and not as text/html).
However, sending an XML mimetype will cause your pages not to be parsed by IE7, which only likes text/html.
From w3:
In summary, 'application/xhtml+xml'
SHOULD be used for XHTML Family
documents, and the use of 'text/html'
SHOULD be limited to HTML-compatible
XHTML 1.0 documents. 'application/xml'
and 'text/xml' MAY also be used, but
whenever appropriate,
'application/xhtml+xml' SHOULD be used
rather than those generic XML media
types.
I puzzled over this a few months ago, and the only workable (compatible with FF3+ and IE7) solution was to use the old <script></script> syntax with text/html (HTML syntax + HTML mimetype).
If your server sends the text/html type in its HTTP headers, even with otherwise properly formed XHTML documents, FF3+ will use its HTML rendering mode which means that <script /> will not work (this is a change, Firefox was previously less strict).
This will happen regardless of any fiddling with http-equiv meta elements, the XML prolog or doctype inside your document -- Firefox branches once it gets the text/html header, that determines whether the HTML or XML parser looks inside the document, and the HTML parser does not understand <script />.
Others have answered "how" and quoted spec. Here is the real story of "why no <script/>", after many hours digging into bug reports and mailing lists.
HTML 4
HTML 4 is based on SGML.
SGML has some shorttags, such as <BR//, <B>text</>, <B/text/, or <OL<LI>item</LI</OL>.
XML takes the first form, redefines the ending as ">" (SGML is flexible), so that it becomes <BR/>.
However, HTML did not redfine, so <SCRIPT/> should mean <SCRIPT>>.
(Yes, the '>' should be part of content, and the tag is still not closed.)
Obviously, this is incompatible with XHTML and will break many sites (by the time browsers were mature enough to care about this), so nobody implemented shorttags and the specification advises against them.
Effectively, all 'working' self-ended tags are tags with prohibited end tag on technically non-conformant parsers and are in fact invalid.
It was W3C which came up with this hack to help transitioning to XHTML by making it HTML-compatible.
And <script>'s end tag is not prohibited.
"Self-ending" tag is a hack in HTML 4 and is meaningless.
HTML 5
HTML5 has five types of tags and only 'void' and 'foreign' tags are allowed to be self-closing.
Because <script> is not void (it may have content) and is not foreign (like MathML or SVG), <script> cannot be self-closed, regardless of how you use it.
But why? Can't they regard it as foreign, make special case, or something?
HTML 5 aims to be backward-compatible with implementations of HTML 4 and XHTML 1.
It is not based on SGML or XML; its syntax is mainly concerned with documenting and uniting the implementations.
(This is why <br/> <hr/> etc. are valid HTML 5 despite being invalid HTML4.)
Self-closing <script> is one of the tags where implementations used to differ.
It used to work in Chrome, Safari, and Opera; to my knowledge it never worked in Internet Explorer or Firefox.
This was discussed when HTML 5 was being drafted and got rejected because it breaks browser compatibility.
Webpages that self-close script tag may not render correctly (if at all) in old browsers.
There were other proposals, but they can't solve the compatibility problem either.
After the draft was released, WebKit updated the parser to be in conformance.
Self-closing <script> does not happen in HTML 5 because of backward compatibility to HTML 4 and XHTML 1.
XHTML 1 / XHTML 5
When really served as XHTML, <script/> is really closed, as other answers have stated.
Except that the spec says it should have worked when served as HTML:
XHTML Documents ... may be labeled with the Internet Media Type "text/html" [RFC2854], as they are compatible with most HTML browsers.
So, what happened?
People asked Mozilla to let Firefox parse conforming documents as XHTML regardless of the specified content header (known as content sniffing).
This would have allowed self-closing scripts, and content sniffing was necessary anyway because web hosters were not mature enough to serve the correct header; IE was good at it.
If the first browser war didn't end with IE 6, XHTML may have been on the list, too. But it did end. And IE 6 has a problem with XHTML.
In fact IE did not support the correct MIME type at all, forcing everyone to use text/html for XHTML because IE held major market share for a whole decade.
And also content sniffing can be really bad and people are saying it should be stopped.
Finally, it turns out that the W3C didn't mean XHTML to be sniffable: the document is both, HTML and XHTML, and Content-Type rules.
One can say they were standing firm on "just follow our spec" and ignoring what was practical. A mistake that continued into later XHTML versions.
Anyway, this decision settled the matter for Firefox.
It was 7 years before Chrome was born; there were no other significant browser. Thus it was decided.
Specifying the doctype alone does not trigger XML parsing because of following specifications.
In case anyone's curious, the ultimate reason is that HTML was originally a dialect of SGML, which is XML's weird older brother. In SGML-land, elements can be specified in the DTD as either self-closing (e.g. BR, HR, INPUT), implicitly closeable (e.g. P, LI, TD), or explicitly closeable (e.g. TABLE, DIV, SCRIPT). XML, of course, has no concept of this.
The tag-soup parsers used by modern browsers evolved out of this legacy, although their parsing model isn't pure SGML anymore. And of course, your carefully-crafted XHTML is being treated as badly-written SGML-inspired tag-soup unless you send it with an XML mime type. This is also why...
<p><div>hello</div></p>
...gets interpreted by the browser as:
<p></p><div>hello</div><p></p>
...which is the recipe for a lovely obscure bug that can throw you into fits as you try to code against the DOM.
Internet Explorer 8 and earlier do not support XHTML parsing. Even if you use an XML declaration and/or an XHTML doctype, old IE still parse the document as plain HTML. And in plain HTML, the self-closing syntax is not supported. The trailing slash is just ignored, you have to use an explicit closing tag.
Even browsers with support for XHTML parsing, such as IE 9 and later, will still parse the document as HTML unless you serve the document with a XML content type. But in that case old IE will not display the document at all!
The people above have already pretty much explained the issue, but one thing that might make things clear is that, though people use <br/> and such all the time in HTML documents, any / in such a position is basically ignored, and only used when trying to make something both parseable as XML and HTML. Try <p/>foo</p>, for example, and you get a regular paragraph.
The self closing script tag won't work, because the script tag can contain inline code, and HTML is not smart enough to turn on or off that feature based on the presence of an attribute.
On the other hand, HTML does have an excellent tag for including
references to outside resources: the <link> tag, and it can be
self-closing. It's already used to include stylesheets, RSS and Atom
feeds, canonical URIs, and all sorts of other goodies. Why not
JavaScript?
If you want the script tag to be self enclosed you can't do that as I said, but there is an alternative, though not a smart one. You can use the self closing link tag and link to your JavaScript by giving it a type of text/javascript and rel as script, something like below:
<link type="text/javascript" rel ="script" href="/path/tp/javascript" />
Unlike XML and XHTML, HTML has no knowledge of the self-closing syntax. Browsers that interpret XHTML as HTML don't know that the / character indicates that the tag should be self-closing; instead they interpret it like an empty attribute and the parser still thinks the tag is 'open'.
Just as <script defer> is treated as <script defer="defer">, <script /> is treated as <script /="/">.
Internet Explorer 8 and older don't support the proper MIME type for XHTML, application/xhtml+xml. If you're serving XHTML as text/html, which you have to for these older versions of Internet Explorer to do anything, it will be interpreted as HTML 4.01. You can only use the short syntax with any element that permits the closing tag to be omitted. See the HTML 4.01 Specification.
The XML 'short form' is interpreted as an attribute named /, which (because there is no equals sign) is interpreted as having an implicit value of "/". This is strictly wrong in HTML 4.01 - undeclared attributes are not permitted - but browsers will ignore it.
IE9 and later support XHTML 5 served with application/xhtml+xml.
That's because SCRIPT TAG is not a VOID ELEMENT.
In an HTML Document - VOID ELEMENTS do not need a "closing tag" at all!
In xhtml, everything is Generic, therefore they all need termination e.g. a "closing tag"; Including br, a simple line-break, as <br></br> or its shorthand <br />.
However, a Script Element is never a void or a parametric Element, because script tag before anything else, is a Browser Instruction, not a Data Description declaration.
Principally, a Semantic Termination Instruction e.g., a "closing tag" is only needed for processing instructions who's semantics cannot be terminated by a succeeding tag. For instance:
<H1> semantics cannot be terminated by a following <P> because it doesn't carry enough of its own semantics to override and therefore terminate the previous H1 instruction set. Although it will be able to break the stream into a new paragraph line, it is not "strong enough" to override the present font size & style line-height pouring down the stream, i.e leaking from H1 (because P doesn't have it).
This is how and why the "/" (termination) signalling has been invented.
A generic no-description termination Tag like < />, would have sufficed for any single fall off the encountered cascade, e.g.: <H1>Title< /> but that's not always the case, because we also want to be capable of "nesting", multiple intermediary tagging of the Stream: split into torrents before wrapping / falling onto another cascade. As a consequence a generic terminator such as < /> would not be able to determine the target of a property to terminate. For example: <b>bold <i>bold-italic < /> italic </>normal. Would undoubtedly fail to get our intention right and would most probably interpret it as bold bold-itallic bold normal.
This is how the notion of a wrapper ie., container was born. (These notions are so similar that it is impossible to discern and sometimes the same element may have both. <H1> is both wrapper and container at the same time. Whereas <B> only a semantic wrapper). We'll need a plain, no semantics container. And of course the invention of a DIV Element came by.
The DIV element is actually a 2BR-Container. Of course the coming of CSS made the whole situation weirder than it would otherwise have been and caused a great confusion with many great consequences - indirectly!
Because with CSS you could easily override the native pre&after BR behavior of a newly invented DIV, it is often referred to, as a "do nothing container". Which is, naturally wrong! DIVs are block elements and will natively break the line of the stream both before and after the end signalling. Soon the WEB started suffering from page DIV-itis. Most of them still are.
The coming of CSS with its capability to fully override and completely redefine the native behavior of any HTML Tag, somehow managed to confuse and blur the whole meaning of HTML existence...
Suddenly all HTML tags appeared as if obsolete, they were defaced, stripped of all their original meaning, identity and purpose. Somehow you'd gain the impression that they're no longer needed. Saying: A single container-wrapper tag would suffice for all the data presentation. Just add the required attributes. Why not have meaningful tags instead; Invent tag names as you go and let the CSS bother with the rest.
This is how xhtml was born and of course the great blunt, paid so dearly by new comers and a distorted vision of what is what, and what's the damn purpose of it all. W3C went from World Wide Web to What Went Wrong, Comrades?!!
The purpose of HTML is to stream meaningful data to the human recipient.
To deliver Information.
The formal part is there to only assist the clarity of information delivery.
xhtml doesn't give the slightest consideration to the information. - To it, the information is absolutely irrelevant.
The most important thing in the matter is to know and be able to understand that xhtml is not just a version of some extended HTML, xhtml is a completely different beast; grounds up; and therefore it is wise to keep them separate.
Simply modern answer is because the tag is denoted as mandatory that way
Tag omission None, both the starting and ending tag are mandatory.
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/script
Difference between 'true XHTML', 'faux XHTML' and 'ordinary HTML' as well as importance of the server-sent MIME type had been already described here well.
If you want to try it out right now, here is simple editable snippet with live preview including self-closed script tag (see <script src="data:text/javascript,/*functionality*/" />) and XML entity (unrelated, see &x;).
As you can see, depending on the MIME type of embedding document the data-URI JavaScript functionality is either executed and consecutive text displayed (in application/xhtml+xml mode) or not executed and consecutive text 'devoured' by the script (in text/html mode).
div { display: flex; }
div + div {flex-direction: column; }
<div>Mime type: <label><input type="radio" onchange="t.onkeyup()" id="x" checked name="mime"> application/xhtml+xml</label>
<label><input type="radio" onchange="t.onkeyup()" name="mime"> text/html</label></div>
<div><textarea id="t" rows="4"
onkeyup="i.src='data:'+(x.checked?'application/xhtml+xml':'text/html')+','+encodeURIComponent(t.value)"
><?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[<!ENTITY x "true XHTML">]>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<p>
<span id="greet" swapto="Hello">Hell, NO :(</span> &x;.
<script src="data:text/javascript,(g=document.getElementById('greet')).innerText=g.getAttribute('swapto')" />
Nice to meet you!
<!--
Previous text node and all further content falls into SCRIPT element content in text/html mode, so is not rendered. Because no end script tag is found, no script runs in text/html
-->
</p>
</body>
</html></textarea>
<iframe id="i" height="80"></iframe>
<script>t.onkeyup()</script>
</div>
You should see Hello, true XHTML. Nice to meet you! below textarea.
For incapable browsers you can copy content of the textarea and save it as a file with .xhtml (or .xht) extension (thanks Alek for this hint).

Resources