What does <! mean in SHTML? - shtml

I am modifying some SHTML files and have run across the following
<! IMG height=1 src="x.gif" width=1 border=0>
I have never seen the <! without the --# following it to denote a server side include.
Is this valid SHTML? If so, what does it do.
I tried searching, but you can imagine how frustrating it gets to do a search on <! ... LOL

Related

Closing anchor tags with HtmlAgilityPack

I am using the HtmlAgilityPack to scrape crummy html and get links, raw text, etc. I'm running into a few pages that have inconsistently closed <a> tags, like this:
<html>
<head></head>
<body>
<a href=...>Here's a great link! <a href=...>Here's another one!</a>
Here's some unrelated text.
</body></html>
HAP parses this, and helpfully closes the open <a> tag, but only at the very end of the document:
<html>
<head></head>
<body>
Here's a great link! <a href="...">Here's another one!
Here's some unrelated text.
</a></body></html>
In practice this means that the InnerText of any unclosed link contains all text from the rest of the page, which gets exciting when parsing a page that may contain thousands of unclosed tags and megabytes of text.
So, how can I make HAP close those tags immediately, ideally putting the close just before the next open so that there is never any overlap for an <a>? I've played around with OptionFixNestedTags and OptionAutoCloseOnEnd with no luck, and I've found advice on how to allow overlap, but I'm drawing a blank on actually fixing it.

scrapy: Remove elements from an xpath selector

I'm using scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and subelements of a certain div, EXCEPT a few at the beginning, and a few at the end.
Here's the gist.
<div id="easy-id">
<stuff I don't want>
text I don't want
<div id="another-easy-id" more stuff I don't want>
text I want
<stuff I want>
...
<more stuff I want>
text I want
...
<div id="one-more-easy-id" more stuff I *don't* want>
<more stuff I *don't* want>
NB: The indenting implies closing tags, so everything here is a child of the first div -- the one with id="easy-id"
Because text and nodes are mixed, I haven't been able to figure out a simple xpath selector to grab the stuff I want. At this point, I'm wondering if it's possible to retrieve the result from xpath as an lxml.etree.elementTree, and then hack at it using the .remove() method.
Any suggestions?
I am guessing you want everything from the div with ID another-easy-id up to but not including the one-more-easy-id div.
Stack overflow has not preserved the indenting, so I do not know where the end of the first div element is, but I'm going to guess it ends before the text.
In that case you might want
//div[#id = 'another-easy-id']/following:node()
[not(preceding::div[#id = 'one-more-easy-id']) and not(#id = 'one-more-easy-id')]
If this is XHTML you'll need to bind some prefix, h, say, to the XHTML namespace and use h:div in both places.
EDIT: Here's the syntax I went with in the end. (See comments for the reasons.)
//div[#id='easy-id']/div[#id='one-more-easy-id']/preceding-sibling::node()[preceding-sibling::div[#id='another-easy-id']]

#font-face unicode-range attribute

In some html documents I'm using webfonts for only a couple of words. Performance-wise loading a complete font-file seems wasteful. This is where the unicode-range parameter of the #font-face declaration comes in:
http://www.w3.org/TR/css3-fonts/#unicode-range-desc
With it I supposedly can define what characters of the font-file to load, thus improving performance greatly.
But I just can't get it to work. And the odd thing is that it diplays all the characters in firefox, and it fails to load the font in safari just if I include the unicode-range parameter in my declaration.
Any help would be appreciated, below is the html I was testing it with:
<!doctype html>
<html lang="en">
<head>
<style text="text/css">
#font-face {
font-family: 'dream';
src: url(Fonts/Digital-dream/DIGITALDREAM.ttf) format("truetype");
unicode-range: U+FF21;
}
*{
font-family:dream;
font-weight:normal;
}
</style>
</head>
<body>
<p>ASDWEWQDSCF</p>
</body>
</html>
You are misunderstanding the purpose of that value. From that page:
This descriptor defines the range of Unicode characters supported by a given font
So this isn't the glyphs (or characters) to download, this is actually telling the browser what characters are in the font so that the browser can determine if its even worth downloading the font in the first place. If the text being styled isn't in the given unicode-range then the font won't be downloaded.
Example XIII on the page you reference shows a great example. Imagine 3 #font-face rules that share the same font-family attribute. The first rule, however, points to a giant 4.5MB TTF file that has every possible glyph needed. The second rule lists a much smaller 1.2MB TTF but says to only use it only if all of the characters fit into the Japanese glyph range. The third rule lists a very small 190KB file that can be download if the text being styling is only roman.

IE8 & FF XHTML error or badly formed span?

I recently have found a strange occurrence in IE8 & FF.
The designers where using js to dynamically create some span tags for layout (they were placing rounded corner graphics on some tabs). Now the xhtml, in js, looked like this: <span class=”leftcorner” /><span class=”rightcorner” /> and worked perfectly!
As we all know dynamically rendering elements in js can be quite processor intensive so I moved the elements from js into the page source, exactly as above.
... and it didn’t work... not only didn’t it work, it crashes IE8.The fix was simple, put the close span in ie: <span class=”leftcorner”></span>
I am a bit confused by this.
Firstly as far as I am aware <span class=”leftcorner” /> is perfectly valid XHTML!
Secondly it works dynamically, but not in XHTML?!?!?
Can anyone shed any light on this or is it simply another odd occurrence of browsers?
The major browsers only support a small subset of self-closing tags. (See this answer for a complete list.)
Depending on how you were creating the elements in JS, the JavaScript engine probably created a valid element to place in the DOM.
I had similar problem with a tags in IE.
The problem was my links looked like that (it was an icon set with the css, so I didn't need the text in it:
<a href="link" class="icon edit" />
Unfortunately in IE these links were not displayed at all. They have to be in
format (leaving empty text didn't work as well so I put there). So what I did is I add an few extra JS lines to fix it as I didn't want to change all my HTML just for this browser (ps. I'm using jQuery for my JS).
if ($.browser.msie) {
$('a.icon').html('&nbsp');
}
IE in particular does not support XHTML. That is, it will never apply proper XML parsing rules to a document - it will treat it as HTML even with proper DOCTYPE and all. XHTML is not always valid SGML, however. In some cases (such as <br/>) IE can figure it out because it's prepared to parse tagsoup, and not just valid SGML. However, in other cases, the same "tagsoup" behavior means that it won't treat /> as self-closing tag terminator.
In general, my advice is to just use HTML 4.01 Strict. That way you know exactly what to expect. And there's little point in feeding XHTML to browsers when they're treating it as HTML anyway...
See I think that one of the answers to Is writing self closing tags for elements not traditionally empty bad practice? will answer your question.
XHTML is only XHTML if it is served as application/xhtml+xml — otherwise, at least as far as browsers are concerned, it is HTML and treated as tag soup.
As a result, <span /> means "A span start tag" and not "A complete span element". (Technically it should mean "A span start tag and a greater than sign", but that is another story).
The XHTML spec tells you what you need to do to get your XHTML to parse as HTML.
One of the rules is "For non-empty elements, end tags are required". The list of elements includes a quick reference to which are empty and which are not.

FireFox script tag error [duplicate]

What is the reason browsers do not correctly recognize:
<script src="foobar.js" /> <!-- self-closing script element -->
Only this is recognized:
<script src="foobar.js"></script>
Does this break the concept of XHTML support?
Note: This statement is correct at least for all IE (6-8 beta 2).
The non-normative appendix ‘HTML Compatibility Guidelines’ of the XHTML 1 specification says:
С.3. Element Minimization and Empty Element Content
Given an empty instance of an element whose content model is not EMPTY (for example, an empty title or paragraph) do not use the minimized form (e.g. use <p> </p> and not <p />).
XHTML DTD specifies script elements as:
<!-- script statements, which may include CDATA sections -->
<!ELEMENT script (#PCDATA)>
To add to what Brad and squadette have said, the self-closing XML syntax <script /> actually is correct XML, but for it to work in practice, your web server also needs to send your documents as properly formed XML with an XML mimetype like application/xhtml+xml in the HTTP Content-Type header (and not as text/html).
However, sending an XML mimetype will cause your pages not to be parsed by IE7, which only likes text/html.
From w3:
In summary, 'application/xhtml+xml'
SHOULD be used for XHTML Family
documents, and the use of 'text/html'
SHOULD be limited to HTML-compatible
XHTML 1.0 documents. 'application/xml'
and 'text/xml' MAY also be used, but
whenever appropriate,
'application/xhtml+xml' SHOULD be used
rather than those generic XML media
types.
I puzzled over this a few months ago, and the only workable (compatible with FF3+ and IE7) solution was to use the old <script></script> syntax with text/html (HTML syntax + HTML mimetype).
If your server sends the text/html type in its HTTP headers, even with otherwise properly formed XHTML documents, FF3+ will use its HTML rendering mode which means that <script /> will not work (this is a change, Firefox was previously less strict).
This will happen regardless of any fiddling with http-equiv meta elements, the XML prolog or doctype inside your document -- Firefox branches once it gets the text/html header, that determines whether the HTML or XML parser looks inside the document, and the HTML parser does not understand <script />.
Others have answered "how" and quoted spec. Here is the real story of "why no <script/>", after many hours digging into bug reports and mailing lists.
HTML 4
HTML 4 is based on SGML.
SGML has some shorttags, such as <BR//, <B>text</>, <B/text/, or <OL<LI>item</LI</OL>.
XML takes the first form, redefines the ending as ">" (SGML is flexible), so that it becomes <BR/>.
However, HTML did not redfine, so <SCRIPT/> should mean <SCRIPT>>.
(Yes, the '>' should be part of content, and the tag is still not closed.)
Obviously, this is incompatible with XHTML and will break many sites (by the time browsers were mature enough to care about this), so nobody implemented shorttags and the specification advises against them.
Effectively, all 'working' self-ended tags are tags with prohibited end tag on technically non-conformant parsers and are in fact invalid.
It was W3C which came up with this hack to help transitioning to XHTML by making it HTML-compatible.
And <script>'s end tag is not prohibited.
"Self-ending" tag is a hack in HTML 4 and is meaningless.
HTML 5
HTML5 has five types of tags and only 'void' and 'foreign' tags are allowed to be self-closing.
Because <script> is not void (it may have content) and is not foreign (like MathML or SVG), <script> cannot be self-closed, regardless of how you use it.
But why? Can't they regard it as foreign, make special case, or something?
HTML 5 aims to be backward-compatible with implementations of HTML 4 and XHTML 1.
It is not based on SGML or XML; its syntax is mainly concerned with documenting and uniting the implementations.
(This is why <br/> <hr/> etc. are valid HTML 5 despite being invalid HTML4.)
Self-closing <script> is one of the tags where implementations used to differ.
It used to work in Chrome, Safari, and Opera; to my knowledge it never worked in Internet Explorer or Firefox.
This was discussed when HTML 5 was being drafted and got rejected because it breaks browser compatibility.
Webpages that self-close script tag may not render correctly (if at all) in old browsers.
There were other proposals, but they can't solve the compatibility problem either.
After the draft was released, WebKit updated the parser to be in conformance.
Self-closing <script> does not happen in HTML 5 because of backward compatibility to HTML 4 and XHTML 1.
XHTML 1 / XHTML 5
When really served as XHTML, <script/> is really closed, as other answers have stated.
Except that the spec says it should have worked when served as HTML:
XHTML Documents ... may be labeled with the Internet Media Type "text/html" [RFC2854], as they are compatible with most HTML browsers.
So, what happened?
People asked Mozilla to let Firefox parse conforming documents as XHTML regardless of the specified content header (known as content sniffing).
This would have allowed self-closing scripts, and content sniffing was necessary anyway because web hosters were not mature enough to serve the correct header; IE was good at it.
If the first browser war didn't end with IE 6, XHTML may have been on the list, too. But it did end. And IE 6 has a problem with XHTML.
In fact IE did not support the correct MIME type at all, forcing everyone to use text/html for XHTML because IE held major market share for a whole decade.
And also content sniffing can be really bad and people are saying it should be stopped.
Finally, it turns out that the W3C didn't mean XHTML to be sniffable: the document is both, HTML and XHTML, and Content-Type rules.
One can say they were standing firm on "just follow our spec" and ignoring what was practical. A mistake that continued into later XHTML versions.
Anyway, this decision settled the matter for Firefox.
It was 7 years before Chrome was born; there were no other significant browser. Thus it was decided.
Specifying the doctype alone does not trigger XML parsing because of following specifications.
In case anyone's curious, the ultimate reason is that HTML was originally a dialect of SGML, which is XML's weird older brother. In SGML-land, elements can be specified in the DTD as either self-closing (e.g. BR, HR, INPUT), implicitly closeable (e.g. P, LI, TD), or explicitly closeable (e.g. TABLE, DIV, SCRIPT). XML, of course, has no concept of this.
The tag-soup parsers used by modern browsers evolved out of this legacy, although their parsing model isn't pure SGML anymore. And of course, your carefully-crafted XHTML is being treated as badly-written SGML-inspired tag-soup unless you send it with an XML mime type. This is also why...
<p><div>hello</div></p>
...gets interpreted by the browser as:
<p></p><div>hello</div><p></p>
...which is the recipe for a lovely obscure bug that can throw you into fits as you try to code against the DOM.
Internet Explorer 8 and earlier do not support XHTML parsing. Even if you use an XML declaration and/or an XHTML doctype, old IE still parse the document as plain HTML. And in plain HTML, the self-closing syntax is not supported. The trailing slash is just ignored, you have to use an explicit closing tag.
Even browsers with support for XHTML parsing, such as IE 9 and later, will still parse the document as HTML unless you serve the document with a XML content type. But in that case old IE will not display the document at all!
The people above have already pretty much explained the issue, but one thing that might make things clear is that, though people use <br/> and such all the time in HTML documents, any / in such a position is basically ignored, and only used when trying to make something both parseable as XML and HTML. Try <p/>foo</p>, for example, and you get a regular paragraph.
The self closing script tag won't work, because the script tag can contain inline code, and HTML is not smart enough to turn on or off that feature based on the presence of an attribute.
On the other hand, HTML does have an excellent tag for including
references to outside resources: the <link> tag, and it can be
self-closing. It's already used to include stylesheets, RSS and Atom
feeds, canonical URIs, and all sorts of other goodies. Why not
JavaScript?
If you want the script tag to be self enclosed you can't do that as I said, but there is an alternative, though not a smart one. You can use the self closing link tag and link to your JavaScript by giving it a type of text/javascript and rel as script, something like below:
<link type="text/javascript" rel ="script" href="/path/tp/javascript" />
Unlike XML and XHTML, HTML has no knowledge of the self-closing syntax. Browsers that interpret XHTML as HTML don't know that the / character indicates that the tag should be self-closing; instead they interpret it like an empty attribute and the parser still thinks the tag is 'open'.
Just as <script defer> is treated as <script defer="defer">, <script /> is treated as <script /="/">.
Internet Explorer 8 and older don't support the proper MIME type for XHTML, application/xhtml+xml. If you're serving XHTML as text/html, which you have to for these older versions of Internet Explorer to do anything, it will be interpreted as HTML 4.01. You can only use the short syntax with any element that permits the closing tag to be omitted. See the HTML 4.01 Specification.
The XML 'short form' is interpreted as an attribute named /, which (because there is no equals sign) is interpreted as having an implicit value of "/". This is strictly wrong in HTML 4.01 - undeclared attributes are not permitted - but browsers will ignore it.
IE9 and later support XHTML 5 served with application/xhtml+xml.
That's because SCRIPT TAG is not a VOID ELEMENT.
In an HTML Document - VOID ELEMENTS do not need a "closing tag" at all!
In xhtml, everything is Generic, therefore they all need termination e.g. a "closing tag"; Including br, a simple line-break, as <br></br> or its shorthand <br />.
However, a Script Element is never a void or a parametric Element, because script tag before anything else, is a Browser Instruction, not a Data Description declaration.
Principally, a Semantic Termination Instruction e.g., a "closing tag" is only needed for processing instructions who's semantics cannot be terminated by a succeeding tag. For instance:
<H1> semantics cannot be terminated by a following <P> because it doesn't carry enough of its own semantics to override and therefore terminate the previous H1 instruction set. Although it will be able to break the stream into a new paragraph line, it is not "strong enough" to override the present font size & style line-height pouring down the stream, i.e leaking from H1 (because P doesn't have it).
This is how and why the "/" (termination) signalling has been invented.
A generic no-description termination Tag like < />, would have sufficed for any single fall off the encountered cascade, e.g.: <H1>Title< /> but that's not always the case, because we also want to be capable of "nesting", multiple intermediary tagging of the Stream: split into torrents before wrapping / falling onto another cascade. As a consequence a generic terminator such as < /> would not be able to determine the target of a property to terminate. For example: <b>bold <i>bold-italic < /> italic </>normal. Would undoubtedly fail to get our intention right and would most probably interpret it as bold bold-itallic bold normal.
This is how the notion of a wrapper ie., container was born. (These notions are so similar that it is impossible to discern and sometimes the same element may have both. <H1> is both wrapper and container at the same time. Whereas <B> only a semantic wrapper). We'll need a plain, no semantics container. And of course the invention of a DIV Element came by.
The DIV element is actually a 2BR-Container. Of course the coming of CSS made the whole situation weirder than it would otherwise have been and caused a great confusion with many great consequences - indirectly!
Because with CSS you could easily override the native pre&after BR behavior of a newly invented DIV, it is often referred to, as a "do nothing container". Which is, naturally wrong! DIVs are block elements and will natively break the line of the stream both before and after the end signalling. Soon the WEB started suffering from page DIV-itis. Most of them still are.
The coming of CSS with its capability to fully override and completely redefine the native behavior of any HTML Tag, somehow managed to confuse and blur the whole meaning of HTML existence...
Suddenly all HTML tags appeared as if obsolete, they were defaced, stripped of all their original meaning, identity and purpose. Somehow you'd gain the impression that they're no longer needed. Saying: A single container-wrapper tag would suffice for all the data presentation. Just add the required attributes. Why not have meaningful tags instead; Invent tag names as you go and let the CSS bother with the rest.
This is how xhtml was born and of course the great blunt, paid so dearly by new comers and a distorted vision of what is what, and what's the damn purpose of it all. W3C went from World Wide Web to What Went Wrong, Comrades?!!
The purpose of HTML is to stream meaningful data to the human recipient.
To deliver Information.
The formal part is there to only assist the clarity of information delivery.
xhtml doesn't give the slightest consideration to the information. - To it, the information is absolutely irrelevant.
The most important thing in the matter is to know and be able to understand that xhtml is not just a version of some extended HTML, xhtml is a completely different beast; grounds up; and therefore it is wise to keep them separate.
Simply modern answer is because the tag is denoted as mandatory that way
Tag omission None, both the starting and ending tag are mandatory.
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/script
Difference between 'true XHTML', 'faux XHTML' and 'ordinary HTML' as well as importance of the server-sent MIME type had been already described here well.
If you want to try it out right now, here is simple editable snippet with live preview including self-closed script tag (see <script src="data:text/javascript,/*functionality*/" />) and XML entity (unrelated, see &x;).
As you can see, depending on the MIME type of embedding document the data-URI JavaScript functionality is either executed and consecutive text displayed (in application/xhtml+xml mode) or not executed and consecutive text 'devoured' by the script (in text/html mode).
div { display: flex; }
div + div {flex-direction: column; }
<div>Mime type: <label><input type="radio" onchange="t.onkeyup()" id="x" checked name="mime"> application/xhtml+xml</label>
<label><input type="radio" onchange="t.onkeyup()" name="mime"> text/html</label></div>
<div><textarea id="t" rows="4"
onkeyup="i.src='data:'+(x.checked?'application/xhtml+xml':'text/html')+','+encodeURIComponent(t.value)"
><?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[<!ENTITY x "true XHTML">]>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<p>
<span id="greet" swapto="Hello">Hell, NO :(</span> &x;.
<script src="data:text/javascript,(g=document.getElementById('greet')).innerText=g.getAttribute('swapto')" />
Nice to meet you!
<!--
Previous text node and all further content falls into SCRIPT element content in text/html mode, so is not rendered. Because no end script tag is found, no script runs in text/html
-->
</p>
</body>
</html></textarea>
<iframe id="i" height="80"></iframe>
<script>t.onkeyup()</script>
</div>
You should see Hello, true XHTML. Nice to meet you! below textarea.
For incapable browsers you can copy content of the textarea and save it as a file with .xhtml (or .xht) extension (thanks Alek for this hint).

Resources