W3C Validation and code in website text - validation

If I were to take the error logs of a file, and put them on a properly formatted HTML page, the page might not pass validation by W3C's validation tool because the error text contains markup that the validation tool gets confused with. How can I properly mark such text such that the validation tool realizes that it's reading what should be interpreted as a normal string and not markup?

You need to convert the following characters into their HTML entities (stolen from the PHP manual on htmlspecialchars()):
'&' (ampersand) becomes '&'
'"' (double quote) becomes '"'
''' (single quote) becomes '''
'<' (less than) becomes '<'
'>' (greater than) becomes '>'

If it really is a normal string, then you should represent the characters that have special meaning in HTML with their respective entities. < to < etc.
This will, of course, cause browsers to render those characters as text and not tags.
If you want browsers to render them as tags, then they aren't a normal string and they are a validity error.

Related

Emphasize multiple parts within a single string [duplicate]

How can I make a part of a word bold in reStructuredText?
Here is an example of what I need: ".rst stands for restructured text."
I was surprised that you could not simply write
.rst stands for **r**e**s**tructured **t**ext.
but the reStructuredText specification indeed states that inline markup must be followed by white-space or one of - . , : ; ! ? \ / ' " ) ] } or >, so the above string of reStructuredText is not valid. However, only a minor change is required to get valid character markup with backslash escapes. Changing the above to
.rst stands for **r**\ e\ **s**\ tructured **t**\ ext.
works fine. To see this in action try the online reST to HTML converter.

How Do Validators Differentiate Between '&' and '&amp'?

Knowing that & is the html entity value of & - how do validators like w3c know this? Even when I look at my source code it's already been parsed into the correct value.
Your question is based on a false premise -- as Co_42 noted, & is not the "ASCII value" of '&'. It's a HTML character reference representing the character '&'. The ASCII value of '&' is 38 (or 0x26).
Your source code almost certainly consists of ASCII or Unicode text files. Those don't use HTML entities. If you have a string with an ampersand stored in the source code, it'll probably be stored with a bare "&". If there's a string literal somewhere containing actual HTML data, it may contain "&".
When you use some sort of tool or function to convert strings to text ready to put into for an HTML or XML document, any "&" will be (should be!) converted into "&".
When a program that reads HTML documents encounters an ASCII "&", it can assume that that's the beginning of a HTML character reference. This is okay because all ampersands in the actual text should have been converted into "&".
As a somewhat perverse example, if you open your source code in a word processor and save it as an HTML document, you'll find that in the actual file, "&" has been converted into "&" (and "&" has been converted into "&amp;"). If you then open that document in a browser, you'll find that the ampersands are displayed the same way they are when you view your source code in a text editor. The encoding step that happened when you saved the HTML document corresponds to the decoding step that happens when the browser displays it.
If you put something like "Fish & chips" directly into an actual HTML document, your HTML document will be invalid. Complicating the matter is the fact that programs such as browsers tend to try to recover from errors in document and display the documents anyway. As such, your browser may still display "Fish & chips" on the screen when you open your invalid document. However, a program such as the W3C validator, which is specifically meant to discover errors in HTML documents, will notify you that your document is invalid.

Error while rendering '&' in telerik html textbox

How can I handle an ampersand ("&") character in a Telerik HTML textbox?
While rendering, it's giving me an error. Also, does anybody know about any other character that may cause errors in an HTML textbox?
Ampersand is a special character in HTML that specifies the start of an escape sequence (so you can do something like © to get a copyright symbol, etc.). If you want to display an ampersand you have to escape it. So if you replace all ampersands with &, that should take care of the error.
However, if there were ampersands in your input that were already escaped - like maybe your data had © - you wouldn't want to escape that ampersand. But if your data won't have any of these ampersands, a simple replace should be fine.
You also need to replace greater than and less than symbols (> and <) with > and < respectively.
Telerik talks about these limitations/issues on this page http://www.telerik.com/help/reporting/report-items-html-text-box.html
Also according to the HTML specification (and the general XML
specification as well) the "&", "<" and ">" characters are considered
special (markup delimiters), so they need to be encoded in order to be
treated as regular text. For example the "&" character can be escaped
with the "&" entity. More information on the subject you can find in
this w3.org article.

Bug in Chrome, or Stupidity in User? Sanitising inputs on forms?

I've written a more detailed post about this on my blog at:
http://idisposable.co.uk/2010/07/chrome-are-you-sanitising-my-inputs-without-my-permission/
but basically, I have a string which is:
||abcdefg
hijklmn
opqrstu
vwxyz
||
the pipes I've added to give an indiciation of where the string starts and ends, in particular note the final carriage return on the last line.
I need to put this into a hidden form variable to post off to a supplier.
In basically, any browser except chrome, I get the following:
<input type="hidden" id="pareqMsg" value="abcdefg
hijklmn
opqrstu
vwxyz
" />
but in chrome, it seems to apply a .Trim() or something else that gives me:
<input type="hidden" id="pareqMsg" value="abcdefg
hijklmn
opqrstu
vwxyz" />
Notice it's cut off the last carriage return. These carriage returns (when Encoded) come up as %0A if that helps.
Basically, in any browser except chrome, the whole thing just works and I get the desired response from the third party. In Chrome, I get an 'invalid pareq' message (which suggests to me that those last carriage returns are important to the supplier).
Chrome version is 5.0.375.99
Am I going mad, or is this a bug?
Cheers,
Terry
You can't rely on form submission to preserve the exact character data you include in the value of a hidden field. I've had issues in the past with Firefox converting CRLF (\r\n) sequences into bare LFs, and your experience shows that Chrome's behaviour is similarly confusing.
And it turns out, it's not really a bug.
Remember that what you're supplying here is an HTML attribute value - strictly, the HTML 4 DTD defines the value attribute of the <input> element as of type CDATA. The HTML spec has this to say about CDATA attribute values:
User agents should interpret attribute values as follows:
Replace character entities with characters,
Ignore line feeds,
Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA attribute values (e.g., " myval " may be interpreted as "myval"). Authors should not declare attribute values with leading or trailing white space.
So whitespace within the attribute value is subject to a number of user agent transformations - conforming browsers should apparently be discarding all your linefeeds, not only the trailing one - so Chrome's behaviour is indeed buggy, but in the opposite direction to the one you want.
However, note that the browser is also expected to replace character entities with characters - which suggests you ought to be able to encode your CRs and LFs as 
 and
, and even spaces as , eliminating any actual whitespace characters from your value field altogether.
However, browser compliance with these SGML parsing rules is, as you've found, patchy, so your mileage may certainly vary.
Confirmed it here. It trims trailing CRLFs, they don't get parsed into the browser's DOM (I assume for all HTML attributes).
If you append CRLF with script, e.g.
var pareqMsg = document.forms[0]['pareqMsg']
if (/\r\n$/.test(pareqMsg.value) == false)
pareqMsg.value += '\r\n';
...they do get maintained and POSTed back to the server. Although the hidden <textarea> idea suggested by Gaby might be easier!
Normally in an input box you cannot enter (by keyboard) a newline.. so perhaps chrome enforces this even for embedded, through the attributes, values ..
try using a textarea (with display:none)..

Problem With Regular Expression to Remove HTML Tags

In my Ruby app, I've used the following method and regular expression to remove all HTML tags from a string:
str.gsub(/<\/?[^>]*>/,"")
This regular expression did just about all I was expecting it to, except it caused all quotation marks to be transformed into “
and all single quotes to be changed to ”
.
What's the obvious thing I'm missing to convert the messy codes back into their proper characters?
Edit: The problem occurs with or without the Regular Expression, so it's clear my problem has nothing to do with it. My question now is how to deal with this formatting error and correct it. Thanks!
Use CGI::unescapeHTML after you perform your regular expression substitution:
CGI::unescapeHTML(str.gsub(/<\/?[^>]*>/,""))
See http://www.ruby-doc.org/core/classes/CGI.html#M000547
In the above code snippet, gsub removes all HTML tags. Then, unescapeHTML() reverts all HTML entities (such as <, &#8220) to their actual characters (<, quotes, etc.)
With respect to another post on this page, note that you will never ever be passed HTML such as
<tag attribute="<value>">2 + 3 < 6</tag>
(which is invalid HTML); what you may receive is, instead:
<tag attribute="<value>">2 + 3 < 6</tag>
The call to gsub will transform the above to:
2 + 3 < 6
And unescapeHTML will finish the job:
2 + 3 < 6
You're going to run into more trouble when you see something like:
<doohickey name="<foobar>">
You'll want to apply something like:
gsub(/<[^<>]*>/, "")
...for as long as the pattern matches.
This regular expression did just about
all I was expecting it to, except it
caused all quotation marks to be
transformed into “ and all
single quotes to be changed to ”
.
This doesn't sound as if the RegExp would be doing this. Are you sure it's different before?
See this question here for information about the problem, it has got an excellent answer:
Get non UTF-8 form fields as UTF-8 in php.
I've run into a similar problem with character changes, this happened when my code ran through another module that enforced UTF-8 encoding and then when it came back, I had a different file (slurped array of lines) on my hands.
You could use a multi-pass system to get the results you are looking for.
After running your regular expression, run an expression to convert &8220; to quotes and another to convert &8221; to single quotes.

Resources