Do I really need to encode '&' as '&'? - validation

I'm using an '&' symbol with HTML5 and UTF-8 in my site's <title>. Google shows the ampersand fine on its SERPs, as do all the browsers in their titles.
http://validator.w3.org is giving me this:
& did not start a character reference. (& probably should have been escaped as &.)
Do I really need to do &?
I'm not fussed about my pages validating for the sake of validating, but I'm curious to hear people's opinions on this and if it's important and why.

Yes. Just as the error said, in HTML, attributes are #PCDATA meaning they're parsed. This means you can use character entities in the attributes. Using & by itself is wrong and if not for lenient browsers and the fact that this is HTML not XHTML, would break the parsing. Just escape it as & and everything would be fine.
HTML5 allows you to leave it unescaped, but only when the data that follows does not look like a valid character reference. However, it's better just to escape all instances of this symbol than worry about which ones should be and which ones don't need to be.
Keep this point in mind; if you're not escaping & to &, it's bad enough for data that you create (where the code could very well be invalid), you might also not be escaping tag delimiters, which is a huge problem for user-submitted data, which could very well lead to HTML and script injection, cookie stealing and other exploits.
Please just escape your code. It will save you a lot of trouble in the future.

Validation aside, the fact remains that encoding certain characters is important to an HTML document so that it can render properly and safely as a web page.
Encoding & as & under all circumstances, for me, is an easier rule to live by, reducing the likelihood of errors and failures.
Compare the following: which is easier? Which is easier to bugger up?
Methodology 1
Write some content which includes ampersand characters.
Encode them all.
Methodology 2
(with a grain of salt, please ;) )
Write some content which includes ampersand characters.
On a case-by-case basis, look at each ampersand. Determine if:
It is isolated, and as such unambiguously an ampersand. eg. volt & amp > In that case don't bother encoding it.
It is not isolated, but you feel it is nonetheless unambiguous, as the resulting entity does not exist and will never exist since the entity list could never evolve. E.g., amp&volt >. In that case, don't bother encoding it.
It is not isolated, and ambiguous. E.g., volt&amp > Encode it.
??

HTML5 rules are different from HTML4. It's not required in HTML5 - unless the ampersand looks like it starts a parameter name. "&copy=2" is still a problem, for example, since © is the copyright symbol.
However it seems to me that it's harder work to decide to encode or not to encode depending on the following text. So the easiest path is probably to encode all the time.

I think this has turned into more of a question of "why follow the spec when browser's don't care." Here is my generalized answer:
Standards are not a "present" thing. They are a "future" thing. If we, as developers, follow web standards, then browser vendors are more likely to correctly implement those standards, and we move closer to a completely interoperable web, where CSS hacks, feature detection, and browser detection are not necessary. Where we don't have to figure out why our layouts break in a particular browser, or how to work around that.
Specifically, if HTML5 does not require using & in your specific situation, and you're using an HTML5 doctype (and also expecting your users to be using HTML5-compliant browsers), then there is no reason to do it.

Well, if it comes from user input then absolutely yes, for obvious reasons. Think if this very website didn't do it: the title of this question would show up as Do I really need to encode ‘&’ as ‘&’?
If it's just something like echo '<title>Dolce & Gabbana</title>'; then strictly speaking you don't have to. It would be better, but if you don't, no user will notice the difference.

Could you show us what your title actually is? When I submit
<!DOCTYPE html>
<html>
<title>Dolce & Gabbana</title>
<body>
<p>Am I allowed loose & mpersands?</p>
</body>
</html>
to http://validator.w3.org/ - explicitly asking it to use the experimental HTML 5 mode - it has no complaints about the &s...

In HTML, a & marks the begin of a reference, either of a character reference or of an entity reference. From that point on, the parser expects either a # denoting a character reference, or an entity name denoting an entity reference, both followed by a ;. That’s the normal behavior.
But if the reference name or just the reference opening & is followed by a white space or other delimiters like ", ', <, >, &, the ending ; and even a reference to represent a plain, & can be omitted:
<p title="&">foo & bar</p>
<p title="&amp">foo &amp bar</p>
<p title="&">foo & bar</p>
Only in these cases can the ending ; or even the reference itself be omitted (at least in HTML 4). I think HTML 5 requires the ending ;.
But the specification recommends to always use a reference like the character reference & or the entity reference & to avoid confusion:
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.

Update (March 2020): The W3C validator no longer complains about escaping URLs.
I was checking why image URLs need escaping and hence tried it in https://validator.w3.org. The explanation is pretty nice. It highlights that even URLs need to be escaped. [PS: I guess it will be unescaped when it's consumed since URLs need &. Can anyone clarify?]
<img alt="" src="foo?bar=qut&qux=fop" />
An entity reference was found in the document, but there is no
reference by that name defined. Often this is caused by misspelling
the reference name, unencoded ampersands, or by leaving off the
trailing semicolon (;). The most common cause of this error is
unencoded ampersands in URLs as described by the WDG in "Ampersands in
URLs". Entity references start with an ampersand (&) and end with a
semicolon (;). If you want to use a literal ampersand in your document
you must encode it as "&" (even inside URLs!). Be careful to end
entity references with a semicolon or your entity reference may get
interpreted in connection with the following text. Also keep in mind
that named entity references are case-sensitive; &Aelig; and æ
are different characters. If this error appears in some markup
generated by PHP's session handling code, this article has
explanations and solutions to your problem.

It depends on the likelihood of a semicolon ending up near your &, causing it to display something quite different.
For example, when dealing with input from users (say, if you include the user-provided subject of a forum post in your title tags), you never know where they might be putting random semicolons, and it might randomly display strange entities. So always escape in that situation.
For your own static HTML content, sure, you could skip it, but it's so trivial to include proper escaping, that there's no good reason to avoid it.

If the user passes it to you, or it will wind up in a URL, you need to escape it.
If it appears in static text on a page? All browsers will get this one right either way, and you don't worry much about it, since it will work.

Yes, you should try to serve valid code if possible.
Most browsers will silently correct this error, but there is a problem with relying on the error handling in the browsers. There is no standard for how to handle incorrect code, so it's up to each browser vendor to try to figure out what to do with each error, and the results may vary.
Some examples where browsers are likely to react differently is if you put elements inside a table but outside the table cells, or if you nest links inside each other.
For your specific example it's not likely to cause any problems, but error correction in the browser might for example cause the browser to change from standards compliant mode into quirks mode, which could make your layout break down completely.
So, you should correct errors like this in the code, if not for anything else so to keep the error list in the validator short, so that you can spot more serious problems.

A couple of years ago, we got a report that one of our web apps wasn't displaying correctly in Firefox. It turned out that the page contained a tag that looked like
<div style="..." ... style="...">
When faced with a repeated style attribute, Internet Explorer combines both of the styles, while Firefox only uses one of them, hence the different behavior. I changed the tag to
<div style="...; ..." ...>
and sure enough, it fixed the problem! The moral of the story is that browsers have more consistent handling of valid HTML than of invalid HTML. So, fix your damn markup already! (Or use HTML Tidy to fix it.)

If & is used in HTML then you should escape it.
If & is used in JavaScript strings, e.g., an alert('This & that'); or document.href, you don't need to use it.
If you're using document.write then you should use it, e.g. document.write(<p>this & that</p>).

If you're really talking about the static text
<title>Foo & Bar</title>
stored in some file on the hard disk and served directly by a server, then yes: it probably doesn't need to be escaped.
However, since there is very little HTML content nowadays that's completely static, I'll add the following disclaimer that assumes that the HTML content is generated from some other source (database content, user input, web service call result, legacy API result, ...):
If you don't escape a simple &, then chances are you also don't escape a & or a or <b> or <script src="http://attacker.com/evil.js"> or any other invalid text. That would mean that you are at best displaying your content wrongly and more likely are suspectible to XSS attacks.
In other words: when you're already checking and escaping the other more problematic cases, then there's almost no reason to leave the not-totally-broken-but-still-somewhat-fishy standalone-& unescaped.

The link has a fairly good example of when and why you may need to escape & to &
https://jsfiddle.net/vh2h7usk/1/
Interestingly, I had to escape the character in order to represent it properly in my answer here. If I were to use the built-in code sample option (from the answer panel), I can just type in & and it appears as it should. But if I were to manually use the <code></code> element, then I have to escape in order to represent it correctly :)

Related

Security benefits of encoding HTML special characters in JSON responses

I have recently received a recommendation, from a third party, to encode HTML special characters in all server responses "for security reasons". So:
' --> '
& --> &
e.g.
{ "id": 1, "name": "Miles O'Brien" }
Question: Is there a security gain in doing this, or is it just a paranoia?
& --> &
Are you sure this was the kind of encoding they meant?
There is a reason to encode HTML-special characters being returned inside JSON responses, and that's to avoid XSS causing by unwanted type-sniffing. For example if you had:
{ "name": "<body>Mister <script>...</script>" }
and an attacker included a link to your JSON-returning resource in an HTML context (eg iframe src) then a stupid browser might decide that, due to the giveaway string <body>, your document was not a JSON object but an HTML document. It could then execute the script in your security context, leading to XSS vulns.
The solution to this is to use JSON string literal escaping, for example:
{ "name": "\u003Cbody\u003EMister \u003Cscript\u003E...\u003C/script\u003E" }
Using HTML-escaping in this context, whilst it avoids the problem, has the side-effect of changing the meaning of the strings. "Miles O'Brien" read by a JSON parser is still Miles O'Brien with the ampersand-x-twenty-seven in it, so if you're writing that value to the page using the likes of .value, .textContent or jQuery .text() it's going to look weird.
Now if you were assigning that string to .innerHTML or jQuery .html() instead, then yeah, you'd definitely need to HTML-escape it at some point, regardless of the JSON XSS problem. However I'd suggest that in this case, for separation-of-concerns reasons, that point should be at the client end where you're actually injecting the content into HTML markup, rather than the server side generating the JSON. In general it is better to avoid injecting strings into markup anyhow, when safer DOM-style methods are available.
Depending on what you are using the data for, yes there is a security benefit.
If you were taking user input, and sending it back to your server, then using it to interact with your database; I could potentially terminate one of your strings, and inject my own SQL statements. And even without a malicious mindset, you sending around quotation characters could accidentally terminate strings.
There seems to be a tendency to protect novice/stupid/naive developers from creating XSS holes in their sites. Especially when someone else is going to deal with these responses (e.g. open API, some junior developer on your team) he might forget to properly HTML-encode the strings before feeding them to some $('#myelement).html() method. The idea is that escaping these responses on the server will result in double escaping (worst case) for developers who don't understand escaping, whereas "smart" developers will know when to unescape the values before using them. The alternative being that "less smart" developers will create a site filled with XSS security holes.
Personally I'm not a big fan of this tendency, but I certainly see how it will result in a safer internet overall, especially as web development is more and more being practiced as a hobby.... What you choose to do is up to you, but this is the rationale behind the request to html-escape all strings in JSON.
Examples of others doing this:
Spotify
Open Social (I don't have a link, I'm sure you'll take my word for it)

NSXMLParser with HTML Containing Javascript and "bad" characters

I am using NSXMLParser to parse HTML from web sites. Testing site is under my control but in operation sites will not be.
Problem is when parser encounters javascript which contains "bad" characters. For example, javascript containing if(screen.width<=521). The problem is the < in the code. I can see the problem but am unsure if there is any good way round it. (the NSXMLParser is reporting NSXMLParserErrorDomain error 68. and I can see why - it is treating the <= as the start of a new tag but = is not a valid tag name character...). But then what would I do with e.g. if(var<20) ?
I actually not interested in the specific content so could do things like global replace/removal of e.g. "<=" and ">=" (etc.) but in some regards that seems a bit of a mess as I was using NSXMLParser to avoid having to start messing around with the content. If substitution is the best way forward, I can envisage "<=" and ">=" but any other sequences I should include ?
I am new to Cocoa so may easily have missed something obvious - in which case many apologies. I did see that others have found similar problems but could not get a good way forward from the questions.
I am handling the error OK (in a tidy manner) but it is preventing my app from doing what it is meant to do - i.e. I need to avoid the error rather than handle it.
Background: that application is doing a "before" and "after" comparison on the html and looking for changes. I could swap "<=" for something really weird, then swap it back when necessary. I could even check the data for the replace content first to eliminate possible ambiguities (e.g. find a UID sequence not in the downloaded page, replace "<=" with UID sequence, parse page, if need be, replace UID with "<=", ditto for ">=".
(I have looked at e.g. libtidy of libxml2 but cannot find easy documentation and am wary about launching down such a route if it will not solve the issues.)
NSXMLParser, as its name implies, is not meant for parsing HTML. XML is much stricter than HTML, and the errors you've encountered are certainly not the only ones that are possible with real-world HTML. There are HTML documents that are also valid XML, but that is the exception, rather than the norm.
I would suggest using a proper HTML parser instead, such as this one, which is an Objective-C wrapper around libxml's HTML parsing functions.

HTML/XSS escape on input vs output

From everything I've seen, it seems like the convention for escaping html on user-entered content (for the purposes of preventing XSS) is to do it when rendering content. Most templating languages seem to do it by default, and I've come across things like this stackoverflow answer arguing that this logic is the job of the presentation layer.
So my question is, why is this the case? To me it seems cleaner to escape on input (i.e. form or model validation) so you can work under the assumption that anything in the database is safe to display on a page, for the following reasons:
Variety of output formats - for a modern web app, you may be using a combination of server-side html rendering, a JavaScript web app using AJAX/JSON, and mobile app that receives JSON (and which may or may not have some webviews, which may be JavaScript apps or server-rendered html). So you have to deal with html escaping all over the place. But input will always get instantiated as a model (and validated) before being saved to db, and your models can all inherit from the same base class.
You already have to be careful about input to prevent code-injection attacks (granted this is usually abstracted to the ORM or db cursor, but still), so why not also worry about html escaping here so you don't have to worry about anything security-related on output?
I would love to hear the arguments as to why html escaping on page render is preferred
In addition to what has been written already:
Precisely because you have a variety of output formats, and you cannot guarantee that all of them will need HTML escaping. If you are serving data over a JSON API, you have no idea whether the client needs it for a HTML page or a text output (e.g. an email). Why should you force your client to unescape "Jack & Jill" to get "Jack & Jill"?
You are corrupting your data by default.
When someone does a keyword search for 'amp', they get "Jack & Jill". Why? Because you've corrupted your data.
Suppose one of the inputs is a URL: http://example.com/?x=1&y=2. You want to parse this URL, and extract the y parameter if it exists. This silently fails, because your URL has been corrupted into http://example.com/?x=1&y=2.
It's simply the wrong layer to do it - HTML related stuff should not be mixed up with raw HTTP handling. The database shouldn't be storing things that are related to one possible output format.
XSS and SQL Injection are not the only security problems, there are issues for every output you deal with - such as filesystem (think extensions like '.php' that cause web servers to execute code) and SMTP (think newline characters), and any number of others. Thinking you can "deal with security on input and then forget about it" decreases security. Rather you should be delegating escaping to specific backends that don't trust their input data.
You shouldn't be doing HTML escaping "all over the place". You should be doing it exactly once for every output that needs it - just like with any escaping for any backend. For SQL, you should be doing SQL escaping once, same goes for SMTP etc. Usually, you won't be doing any escaping - you'll be using a library that handles it for you.
If you are using sensible frameworks/libraries, this is not hard. I never manually apply SQL/SMTP/HTML escaping in my web apps, and I never have XSS/SQL injection vulnerabilities. If your method of building web pages requires you to remember to apply escaping, or end up with a vulnerability, you are doing it wrong.
Doing escaping at the form/http input level doesn't ensure safety, because nothing guarantees that data doesn't get into your database or system from another route. You've got to manually ensure that all inputs to your system are applying HTML escaping.
You may say that you don't have other inputs, but what if your system grows? It's often too late to go back and change your decision, because by this time you've got a ton of data, and may have compatibility with external interfaces e.g. public APIs to worry about, which are all expecting the data to be HTML escaped.
Even web inputs to the system are not safe, because often you have another layer of encoding applied e.g. you might need base64 encoded input in some entry point. Your automatic HTML escaping will miss any HTML encoded within that data. So you will have to do HTML escaping again, and remember to do, and keep track of where you have done it.
I've expanded on these here: http://lukeplant.me.uk/blog/posts/why-escape-on-input-is-a-bad-idea/
The original misconception
Do not confuse sanitation of output with validation.
While <script>alert(1);</script> is a perfectly valid username, it definitely must be escaped before showing on the website.
And yes, there is such a thing as "presentation logic", which is not related to "domain business logic". And said presentation logic is what presentation layer deals with. And the View instances in particular. In a well written MVC, Views are full-blown objects (contrary to what RoR would try to to tell you), which, when applied in web context, juggle multiple templates.
About your reasons
Different output formats should be handled by different views. The rules and restrictions, which govern HTML, XML, JSON and other formats, are different in each case.
You always need to store the original input (sanitized to avoid injections, if you are not using prepared statements), because someone might need to edit it at some point.
And storing original and the xss-safe "public" version is waste. If you want to store sanitized output, because it takes too much resources to sanitize it each time, then you are already pissing at the wrong tree. This is a case, when you use cache, instead of polluting the database.

Validating FirstName in a web application

I do not want to be too strict as there may be thousands of possible characters in a possible first name
Normal english alphabets, accented letters, non english letters, numbers(??), common punctuation synbols
e.g.
D'souza
D'Anza
M.D. Shah (dots and space)
Al-Rashid
Jatin "Tom" Shah
However, I do not want to except HTML tags, semicolons etc
Is there a list of such characters which is absolutely bad from a web application perspective
I can then use RegEx to blacklist these characters
Background on my application
It is a Java Servlet-JSP based web app.
Tomcat on Linux with MySQL (and sometimes MongoDB) as a backend
What I have tried so far
String regex = "[^<>~##$%;]*";
if(!fname.matches(regex))
throw new InputValidationException("Invalid FirstName")
My question is more on the design than coding ... I am looking for a exhaustive (well to a good degree of exhaustiveness) list of characters that I should blacklist
A better approach is to accept anything anyone wants to enter and then escape any problematic characters in the context where they might cause a problem.
For instance, there's no reason to prohibit people from using <i> in their names (although it might be highly unlikely that it's a legit name), and it only poses a potential problem (XSS) when you are generating HTML for your users. Similarly, disallowing quotes, semi-colons, etc. only make sense in other scenarios (SQL queries, etc.). If the rules are different in different places and you want to sanitize input, then you need all the rules in the same place (what about whitespace? Are you gong to create filenames including the user's first name? If so, maybe you'll have to add that to the blacklist).
Assume that you are going to get it wrong in at least one case: maybe there is something you haven't considered for your first implementation, so you go back and add the new item(s) to your blacklist. You still have users who have already registered with tainted data. So, you can either run through your entire database sanitizing the data (which could take a very very long time), or you can just do what you really have to do anyway: sanitize data as it is being presented for the current medium. That way, you only have to manage the sanitization at the relevant points (no need to protect HTML output from SQL injection attacks) and it will work for all your data, not just data you collect after you implement your blacklist.

How to remove special charecters in wordpress?

I am using Topsy, It returns me title of highest ranking article of my mebsite, It returns me one RSS file which contains post title with there link. For now i am only taking post name and using post title am trying to search in mysql database using following function like this:
get_post_by_title($postTitle,'post');
But the problem is topsy returns me post title but it also add some special characters in RSS file like " ' " replace with " ’ " this charecters.Because of this get_post_by_title() function does not return me post by title name.
EDIT : It returns me one post title like this :
iPad Applications In Bloom’s Taxonomy NEXT
Here single quote is special charecter.
Please help me. Thanks
First let's clear up a misconception: that character in your example is not a "special" character. It is Unicode code point U+2019, "RIGHT SINGLE QUOTATION MARK." Its HTML entity reference is ’. It's an ordinary character - it just happens to be an ordinary character that has no representation in ASCII. Before getting to an answer to your specific question, I need to tell you to read Joel Spolsky's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" - it is just what it says on the tin, and unless you absorb at least a little more knowledge about Unicode, you will keep running into problems like this. Don't fret too much: everyone runs into problems like this until they learn how to deal with text. Unicode isn't "hard" so much as it is "prone to exposing unconscious assumptions we make about how text works." †
Now, to your question.
If I'm reading you right, what's happening to you is that you have posts with non-ASCII characters in their titles such as ’ which aren't showing up when you search for them with get_post_by_title() (it seems like you're using something similar to the accepted answer on this question - is that right?) There are two paths to a solution: store the titles in a format that's easier for you to search, or use a searching method that can find non-ASCII characters.
Storing the titles differently would require that you run them through PHP's built-in htmlentities() function or before storing them in your Wordpress DB - you would also want to make sure that you convert characters with no HTML entity equivalent to '\xNN' form, and to make sure that your DB's collation/charset is set to UTF-8 or another Unicode-aware encoding. This will be a nontrivial amount of effort. ‡
Using a different searching method doesn't require tinkering with your DB or digging into WordPress internals, but it does require very careful fiddling with search string. You'll need to either use the exact character you're looking for in a search, expressed as a '\xNN' character reference if necessary, or use wildcards carefully in the search.
Either way, good luck. It may be possible to offer more specific advice if more of your code is visible.
†: By the way, your life with regards to Unicode will also get much, much easier if you use better languages than PHP and better databases than MySQL. WordPress is inextricably tied to PHP and MySQL: PHP & MySQL are both woefully, horrendous, hilariously bad at handling Unicode issues correctly. Your life as a programmer will get better if you extirpate PHP & MySQL from it.
‡: Seriously, PHP is atrociously bad at this, and MySQL is in a shoelaces-tied-together state of fumbling. Avoid them.
remove from wp-config.php
//define('DB_CHARSET', 'utf8');
//define('DB_COLLATE','utf8_unicode_ci');
You can easily remove special characters using preg_replace, see this post -> http://code-tricks.com/filter-non-ascii-characters-using-php/

Resources