How does empty start tag work in HTML4? - html4

The HTML4 specification mentions various SGML shorthand markup constructs. While I understand what others do, with a help of HTML validator, I cannot find understand why anyone would want an empty start tag. It cannot even have attributes, so it's not a shorter <span>.

The SGML definition of HTML4 enables the empty start feature. In it, there is an interesting section with features.
FEATURES
MINIMIZE
DATATAG NO
OMITTAG YES
RANK NO
SHORTTAG YES
LINK
SIMPLE NO
IMPLICIT NO
EXPLICIT NO
OTHER
CONCUR NO
SUBDOC NO
FORMAL YES
APPINFO NONE
The important section of features is MINIMIZE section. It enables OMITTAG which is a standard feature of HTML, which allows start or end tags to be ommited. This is particular allows you to write code like <p> a <p> b, without closing paragraphs.
The more important part is SHORTTAG feature, which is actually a category. However, because it's not expanded, the SGML automatically assumed YES for all entries in it. It has the following categories in it. Feel free to skip this list, if you aren't interested in other shorthand features in SGML.
ATTRIB, which deals with attributes, and has following options.
DEFAULT - defines whether attributes can contain default values. This allows writing <p> without defining every single attribute. Nobody would want to write <p id="" class="" style="" title="" lang="en" dir="ltr" onclick="" ondblclick="" ...></p> after all. Hey, I even gave up trying to write all that. This is a commonly supported feature.
OMITNAME - if the attribute and value have the same name, the value is optional. This allows writing <input type="checkbox" checked> for instance. This is a commonly supported feature (although, HTML5 defines default to be empty string, not an attribute name).
VALUE - allows writing values without quotes. This allows writing code like <p class=warning></p> for instance. This is a commonly supported feature.
ENDTAG, which is a category for end tags containing the following options.
UNCLOSED - allows starting a new tag before ending the previous tag, allowing code like <p><b></b</p>.
EMPTY - allows unnamed end tags, such as <b>something</>. They close most recent element which is still open.
STARTTAG, which is a category for start tags containing the following options.
NETENABL - allows using Null End Tag notation. It's worth noting this notation is incompatible with XHTML. Anyway, the feature allows writing code like <b/<i/hello//, which means the same thing as <b><i>hello</i></b>.
UNCLOSED - allows starting a new tag before ending the previous tag, allowing code like <p<b></b></p>.
EMPTY - this is the asked feature.
Now, it's important to understand what EMPTY does. While <> may appear useless at first (hey, how could you determine what it does, when nothing aside of Validator supports it), it's actually not. It opens the previous sibling, allowing code like the following.
<ul>
<li class=plus> hello world
<> another list element
<> yet another
<li class=minus> nope
<> what am I doing?
</ul>
In this example, the list has two classes, plus and minus for positive and negative arguments. However, the webmaster was lazy (and doesn't care about that HTML4 doesn't support this), and decided to use empty start tag in order to not specify the class for next elements. Because <li> has optional end tag, this automatically closed previous <li> tag.

Related

Trying to find two different text nodes from a descendant

Someone decided to make a site as unfriendly as possible by intention so I'm trying what I can to have our scraper still get to where it should.
<div class="issueDetails">
<div class="issueTitle ng-binding" style="">FANCY UNIQUE TEXT dd.MM.yyyy</div>
<a>COMPLETELY DIFFERENT TEXT</a>
I've left out the unnecessary details here, but I'm trying to find a match within the site through XPATH (can't use anything else for this) that will find something which fulfils both conditions, FANCY UNIQUE TEXT dd.MM.yyyy *as well as COMPLETELY DIFFERENT TEXT.
I've tried my luck with //div[#class='issueDetails']/descendant::*[contains(text(), 'FANCY UNIQUE TEXT dd.MM.yyyy') and contains (text(), 'COMPLETELY DIFFERENT TEXT')]
but it contains the erroneous logic that both unique things I need are in the same thing.
The first, FANCY UNIQUE TEXT, is the unique identifier for where I want to go. The second, COMPLETELY DIFFERENT TEXT, is what I need the scraper to click on to actually head to that specific one. So an XPath that finds both despite them being different descendants is necessary.
Is this what you're looking for :
//div[#class="issueDetails"]/*[contains(.,"COMPLETELY DIFFERENT TEXT") or translate(substring(.,string-length(.)-9,10),"123456789","000000000")="00.00.0000" and contains(.,'FANCY UNIQUE TEXT')]
It will return the 2 elements respecting your conditions : div and a.
Translate, substring-length and substring functions are used to check if a date pattern is present in the div element.
EDIT : Check if the parent+child contains the text you're looking for, then get the childs with :
//div[#class='issueDetails'][contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") and contains(.,"COMPLETELY DIFFERENT TEXT")]/*[contains(.,"FANCY UNIQUE TEXT dd.MM.yyyy") or contains(.,"COMPLETELY DIFFERENT TEXT")]

Forcing string interpolation in Jade

I am trying to use Jade to do some string interpolation + i18n
I wrote a custom tag
mixin unsubscribe
a(title='unsubscribe_link', href='#{target_address}/',
target='_blank', style='color:#00b2e2;text-decoration:none;')
= __("Click here")
Then I got the following to work
p
| #[+unsubscribe] to unsubscribe
However, in order to support i18n I would also like to wrap the the whole string in a translation block the function is called with __().
But when I wrap the string in a code block it no longer renders the custom tag.
p
| #{__("#[+unsubscribe] to unsubscribe")}
p
= __("#[+unsubscribe] to unsubscribe")
will output literally [+unsubscribe] to unsubscribe. Is there a way to force the returned string from the function?
Edit 1
As has been pointed out, nesting the "Click here" doesn't really make sense, since it will be creating separate strings.
My goal with all this is really to create a simplified text string that can be passed off to a translation service:
So ideally it should be:
"#[+unsubscribe('Click here')] to unsubscribe"
and I would get back
"Klicken Sie #[+unsubscribe hier] um Ihr auszutragen"
My reasoning for this is that because using something like gettext will match by exact strings, I would like to abstract out all the logic behind the tag.
What you really want to achieve is this:
<p>
<a href='the link' title='it should also be translated!'
target='_blank' class='classes are better'>Click here</a> to unsubscribe
</p>
And for some reason you don't want to include tags in the translation. Well, unfortunately separating 'Click here' from 'to unsubscribe' will result in incorrect translations for some languages - the translator needs a context. So it is better to use the tag.
And by the way: things like __('Click here') doesn't allow for different translation of the string based on context. I have no idea what translation tool you're using, but it should definitely use identifiers rather than English texts.
Going back to your original question, I believe you can use parametrized mixin to do it:
mixin unsubscribe(title, target_address, click_here, to_unsubscribe)
a(title=title, href=target_address, target='_blank', style='color:#00b2e2;text-decoration:none;')= click_here
span= to_unsubscribe
This of course will result in additional <span> tag and it still does not solve the real issue (separating "Click here" from "to unsubscribe") and no way to re-order this sentence, but... I guess the only valid option would be to have interpolation built-in into translation engine and writing out unescaped tag. Otherwise you'd need to redesign the page to avoid link inside the sentence.

xPath expression for attributes that don't have ancestors with same attribute

I'm trying to extract elements with an attribute, and not extract the descendants separately that have the same attribute.
Using the following html:
<html><body>
<div box>
some text
<div box>
some more text
</div>
</div>
<div box>
this needs to be included as well
</div>
</body></html>
I want to be able to extract the two outer <div box> and its descendants including the inner <div box>, but don't want to have the inner <div box> extracted separately.
I have tried using all sorts of different expressions but think I am missing something quite fundamental. The main expression I have been trying is: //[#box and not(ancestor::#box) but this still returns two elements.
I am trying to do this using the 'Hpricot' (0.8.3) Gem in Ruby 1.9.2 as follows:
# Assuming html is set to the html above
doc = Hpricot(html)
elements = doc.search('//[#box and not(ancestor::#box)]')
# The following is returning 3 instead of 2
elements.size
Any help on this would be great.
Your XPATH is invalid. You have to address something in order to use the predicate filter(e.g. []). Otherwise, there isn't anything to filter.
This XPATH works:
//div[#box and not(ancestor::div/#box)]
If the elements aren't all guarenteed to be <div>, you can use a more generic match for elements:
//*[#box and not(ancestor::*/#box)]
Using elements = doc.search('//[#box and not(ancestor::#box)]') isn't correct.
Use elements = doc.at('//div[#box]') which will find the first occurrence.
I'd recommend using Nokogiri over Hpricot. Nokogiri is well supported, very flexible and more robust.
EDIT: Added because original question changed:
Thanks that worked perfectly, except I forget to mention that I want to return multiple outer elements. Sorry about that, I have updated the question. I will look into Nokogiri further, I didn't choose it originally because Hpricot seemed more approachable.
Remember that XPath acts like accessing a file in a directory at its simplest form, so you can drill down and search in "subdirectories". If you only want the outer <div> tags, then look inside the <body> level and no further:
doc.search('/html/body/div')
or, if you might have unadorned div tags along with the targets:
doc.search('/html/body/div[#box]')
Regarding Hpricot seeming more approachable:
Nokogiri implements a superset of Hpricot's accessors, allowing you to drop it into place for most uses. It supports XPath and CSS accessors allowing more intuitive ways of getting at data if you live in CSS and HTML and don't grok XPath. In addition there are many methods to find your desired target:
doc.search('body > div[box]')
(doc / 'body > div[box]')
doc.css('body > div[box]')
Nokogiri supports the at and % synonym found in Hpricot also, along with css_at, if you only want the first occurrence of something.
I started using Nokogiri after running into some situations where Hpricot exploded because it couldn't handle malformed news-feeds in the wilds.

Is there such a thing as a valid HTML5 fragment?

I obviously can't determine whether a fragment of HTML is valid without knowing what the rest of the document looks like (at a minimum, I would need a doctype in order to know which rules I'm validating against). But given the following HTML5 fragment:
<article><header></article>My header</header><p>My text</p></article>
I can certainly determine that it is invalid without seeing the rest of the document. So, is there such a thing as "provisionally valid" HTML, or "valid providing it fits into a certain place in a valid document"?
Is there more to it than the following pseudocode?
def is_valid_fragment(fragment):
tmp = "<!doctype html><html><head><title></title></head><body>" + fragment + "</body></html>"
return my_HTML5_validator.is_valid_html5_document(tmp)
You can certainly talk about an XML document weing well-formed, and you can construct a document from any single element and its children. You could thus talk about singly-rooted XHTML5 fragments being well-formed. You could deal with a multiply-rooted fragment (like <img/><img/>) by dealing with it as a sequence of documents, or wrapping it in some synthetic container element - since we're only talking about well-formedness, that would be okay.
However, HTML5 still allows the SGML self-closing tags, like <hr> and so on, whose self-closingness can only be determined by appeal to the doctype. For instance, <div><hr></div> is okay, but <div><tr></div> is not. If you were dealing with DOM nodes rather than text as input, this would be a nonissue, but if you have text, you'd need a parser which knows enough about HTML to be able to deal with those elements. Beyond that, though, some very simple rules, lifted directly from XML, would be enough to handle well-formedness.
If you wanted to go beyond well-formedness and look at some aspects of validity, i think you can still do that at the singly-rooted fragment level with XML. As the spec says:
An XML document is valid if it has an associated document type declaration and if the document complies with the constraints expressed in it.
A DTD can name any element as the root, and the mechanics then take care of checking the relationship between that element and its children, and their children and so on, and the various other constraints that make up validity.
Again, you can transfer that idea directly to HTML. I don't know how you deal with multiply-rooted fragments, though. And bear in mind that certain whole-document constraints (like IDs being unique) might hold inside the fragment, but not in an otherwise valid document once the fragment has been inserted into it.
Depending on what you intend to do with this verification, I think you should keep in mind that browsers are extremely forgiving regarding malformed HTML!
The invalid HTML string that you give in your example would work perfectly fine in (most if not all) browers:
const serializedHTML = "<article><header></article>My header</header><p>My text</p></article>"
const range = document.createRange()
const fragment = range.createContextualFragment(serializedHTML)
console.log(fragment)
The content of the fragment defined in the snippet above would result in the following DOM tree:
<article>
<header></header>
</article>
"My header"
<p>My text</p>
A crude method would be to check whether passing the fragment through the innerHTML of another element changes the text by doing something like the code below.
<html>
<head>
</head>
<script>
function validateHTML(htmlFragment) {
var testDiv = document.getElementById('testDiv')
testDiv.innerHTML = htmlFragment
var res = htmlFragment==testDiv.innerHTML
testDiv.innerHTML = ""
return res
}
</script>
<body>
<div id=testDiv style='display:none'></div>
<textarea id=txtElem onKeyUp="this.style.backgroundColor = validateHTML(this.value) ? '' : '#f00'"></textarea>
</body>
</html>
You could check if it is well-formed.

Selenium RC locators - referring to subsequent elements?

When there is more than a single element with the same locator in a page, how should the next elements be referenced?
Using Xpath locators it's possible to add array notation, e.g. xpath=(//span/div)[1]
But with simple locators?
For example, if there are 3 links identified by "link=Click Here", simply appending [3] won't get the 3rd element.
And where is the authoritative reference for addressing array of elements? I couldn't find any.
Selenium doesn't handle arrays of locators by itself. It just returns the first element that meets your query, so if you want to do that, you have to use xpath, dom or even better, css.
So for the link example you should use:
selenium.click("css=a:contains('Click Here'):nth-child(3)")
Santi is correct that Selenium returns the first element matching your specified locator and you have to apply the appropriate expression of the locator type you use. I thought it would be useful to give the details here, though, for in this case they do border on being "gory details":
CSS
The :nth-child pseudo-class is tricky to use; it has subtleties that are little-known and not clearly documented, even on the W3C pages. Consider a list such as this:
<ul>
<li class="bird">petrel</li>
<li class="mammal">platypus</li>
<li class="bird">albatross</li>
<li class="bird">shearwater</li>
</ul>
Then the selector css=li.bird:nth-child(3) returns the albatross element not the shearwater! The reason for this is that it uses your index (3) into the list of elements that are siblings of the first matching element--unfiltered by the .bird class! Once it has the correct element, in this example the third one, it then applies the bird class filter: if the element in hand matches, it returns it. If it does not, it fails to match.
Now consider the selector css=li.bird:nth-child(2). This starts with the second element--platypus--sees it is not a bird and comes up empty. This manifests as your code throwing a "not found" exception!
What might fit the typical mental model of finding an indexed entry is the CSS :nth-of-type pseudo-class which applies the filter before indexing. Unfortunately, this is not supported by Selenium, according to the official documentation on locators.
XPath
Your question already showed that you know how to do this in XPath. Add an array reference at any point in the expression with square brackets. You could, for example use something like this: //*[#id='abc']/div[3]/p[2]/span to find a span in the second paragraph under the 3rd div under the specified id.
DOM
DOM uses the same square bracket notation as XPath except that DOM indexes from zero while XPath indexes from 1: document.getElementsByTagName("div")[1] returns the second div, not the first div! DOM offers an alternate syntax as well: document.getElementsByTagName("div").item(0) is exactly equivalent. And note that with getElementsByTagName you always have to use an index since it returns a node set, not a single node.

Resources