Java ESAPI IMPLEMENTATION - filter

I've a question regarding owasp ESAPI interface for xss protection.To keep it simple short and straight I'm doing a source code review using fortify.
The application implement ESAPI and make call to ESAPI.encoder().canonicalize(user input) and does not do any further validation and prints the output. Is this still vulnerable to xss
PS: The reflection point is inside a html element.
I've gone through all the posts regarding ESAPI interface in stack overflow, but couldn't quite get it
Any help would be appreciated

canonicalize alone doesn't prevent xss at all. It decodes data, but you want the opposite, to encode the data.
Not only does it allow content like <script>alert(1)</script> straight-through, but it also decodes <script>alert(1)</script> from a non-executable script to a executable one.
The method you want instead is encodeForHTML. This will encode the data so it can be inserted safely into an HTML context, so < will become < and so on.
Also, check if you're already doing HTML encoding by checking whether these characters are accepted. Some templating languages and tags do encoding automatically.

Related

Purpose of web app input validation for security reasons

I often encounter advice for protecting a web application against a number of vulnerabilities, like SQL injection and other types of injection, by doing input validation.
It's sometimes even said to be the single most important technique.
Personally, I feel that input validation for security reasons is never necessary and better replaced with
if possible, not mixing user input with a programming language at all (e.g. using parameterized SQL statements instead of concatenating input in the query strings)
escaping user input before mixing it with a programming or markup language (e.g. html escaping, javascript escaping, ...)
Of course for a good UX it's best to catch input that would generate errors on the backand early in the GUI, but that's another matter.
Am I missing something or is the only purpose to try to make up for mistakes against the above two rules?
Yes you are generally correct.
A piece of data is only dangerous when "used". And it is only dangerous if it has special meaning in the context it is used.
For example, <script> is only dangerous if used in output to an HTML page.
Robert'); DROP TABLE Students;-- is only dangerous when used in a database query.
Generally, you want to make this data "safe" as late as possible. Such as HTML encoding when output as HTML to an HTML page, and parameterised when inserting into a database. The big advantage of this is that when the data is later retrieved from these locations, it will be returned in its original, unsanitized format.
So if you have the value A&B O'Leary in an input field, it would be encoded like so:
<input type="hidden" value="A& O'Leary" />
and if this is submitted to your application, your programming framework will automatically decode it for you back to A&B O'Leary. Same with your DB:
string name = "A&B O'Leary";
string sql = "INSERT INTO Customers (Name) VALUES (#Name)";
SqlCommand command = new SqlCommand(sql);
command.Parameters.Add("#Name", name];
Simples.
Additionally if you then need to give the user any output in plain text, you should retrieve it from your DB and spit it out. Or in JavaScript - you just JavaScript entity encode (although best avoided for complexity reasons - I find it easier to secure if I only output to HTML then read the values from the DOM).
If you'd HTML encoded it early, then to output to JavaScript/JSON you'd first have to convert it back then hex entity encode it. It will get messy and some developers will forget they have to decode first and you will have &amps everywhere.
You can use validation as an additional defence, but it should not be the first port of call. For example, if you are validating a UK postcode you would want to whitelist the alphanumeric characters in upper and lower cases. Any other characters would be rejected or removed by your application. This can reduce the chances of SQLi or XSS occurring on your application, but this method falls down where you need inputs to include characters that have special meaning to your output context (" '<> etc). For example, on Stack Overflow if they did not allow characters such as these you would be preventing questions and answers from including code snippets which would pretty much make the site useless.
Not all SQL statements are parameterizable. For example, if you need to use dynamic identifiers (as opposed to literals). Even whitelisting can be hard, sometimes it needs to be dynamic.
Escaping XSS on output is a good idea. Until you forget to escape it on your admin dashboard too and they steal all your admin's cookies. Don't let XSS in your database.

Security benefits of encoding HTML special characters in JSON responses

I have recently received a recommendation, from a third party, to encode HTML special characters in all server responses "for security reasons". So:
' --> '
& --> &
e.g.
{ "id": 1, "name": "Miles O'Brien" }
Question: Is there a security gain in doing this, or is it just a paranoia?
& --> &
Are you sure this was the kind of encoding they meant?
There is a reason to encode HTML-special characters being returned inside JSON responses, and that's to avoid XSS causing by unwanted type-sniffing. For example if you had:
{ "name": "<body>Mister <script>...</script>" }
and an attacker included a link to your JSON-returning resource in an HTML context (eg iframe src) then a stupid browser might decide that, due to the giveaway string <body>, your document was not a JSON object but an HTML document. It could then execute the script in your security context, leading to XSS vulns.
The solution to this is to use JSON string literal escaping, for example:
{ "name": "\u003Cbody\u003EMister \u003Cscript\u003E...\u003C/script\u003E" }
Using HTML-escaping in this context, whilst it avoids the problem, has the side-effect of changing the meaning of the strings. "Miles O'Brien" read by a JSON parser is still Miles O'Brien with the ampersand-x-twenty-seven in it, so if you're writing that value to the page using the likes of .value, .textContent or jQuery .text() it's going to look weird.
Now if you were assigning that string to .innerHTML or jQuery .html() instead, then yeah, you'd definitely need to HTML-escape it at some point, regardless of the JSON XSS problem. However I'd suggest that in this case, for separation-of-concerns reasons, that point should be at the client end where you're actually injecting the content into HTML markup, rather than the server side generating the JSON. In general it is better to avoid injecting strings into markup anyhow, when safer DOM-style methods are available.
Depending on what you are using the data for, yes there is a security benefit.
If you were taking user input, and sending it back to your server, then using it to interact with your database; I could potentially terminate one of your strings, and inject my own SQL statements. And even without a malicious mindset, you sending around quotation characters could accidentally terminate strings.
There seems to be a tendency to protect novice/stupid/naive developers from creating XSS holes in their sites. Especially when someone else is going to deal with these responses (e.g. open API, some junior developer on your team) he might forget to properly HTML-encode the strings before feeding them to some $('#myelement).html() method. The idea is that escaping these responses on the server will result in double escaping (worst case) for developers who don't understand escaping, whereas "smart" developers will know when to unescape the values before using them. The alternative being that "less smart" developers will create a site filled with XSS security holes.
Personally I'm not a big fan of this tendency, but I certainly see how it will result in a safer internet overall, especially as web development is more and more being practiced as a hobby.... What you choose to do is up to you, but this is the rationale behind the request to html-escape all strings in JSON.
Examples of others doing this:
Spotify
Open Social (I don't have a link, I'm sure you'll take my word for it)

HTML/XSS escape on input vs output

From everything I've seen, it seems like the convention for escaping html on user-entered content (for the purposes of preventing XSS) is to do it when rendering content. Most templating languages seem to do it by default, and I've come across things like this stackoverflow answer arguing that this logic is the job of the presentation layer.
So my question is, why is this the case? To me it seems cleaner to escape on input (i.e. form or model validation) so you can work under the assumption that anything in the database is safe to display on a page, for the following reasons:
Variety of output formats - for a modern web app, you may be using a combination of server-side html rendering, a JavaScript web app using AJAX/JSON, and mobile app that receives JSON (and which may or may not have some webviews, which may be JavaScript apps or server-rendered html). So you have to deal with html escaping all over the place. But input will always get instantiated as a model (and validated) before being saved to db, and your models can all inherit from the same base class.
You already have to be careful about input to prevent code-injection attacks (granted this is usually abstracted to the ORM or db cursor, but still), so why not also worry about html escaping here so you don't have to worry about anything security-related on output?
I would love to hear the arguments as to why html escaping on page render is preferred
In addition to what has been written already:
Precisely because you have a variety of output formats, and you cannot guarantee that all of them will need HTML escaping. If you are serving data over a JSON API, you have no idea whether the client needs it for a HTML page or a text output (e.g. an email). Why should you force your client to unescape "Jack & Jill" to get "Jack & Jill"?
You are corrupting your data by default.
When someone does a keyword search for 'amp', they get "Jack & Jill". Why? Because you've corrupted your data.
Suppose one of the inputs is a URL: http://example.com/?x=1&y=2. You want to parse this URL, and extract the y parameter if it exists. This silently fails, because your URL has been corrupted into http://example.com/?x=1&y=2.
It's simply the wrong layer to do it - HTML related stuff should not be mixed up with raw HTTP handling. The database shouldn't be storing things that are related to one possible output format.
XSS and SQL Injection are not the only security problems, there are issues for every output you deal with - such as filesystem (think extensions like '.php' that cause web servers to execute code) and SMTP (think newline characters), and any number of others. Thinking you can "deal with security on input and then forget about it" decreases security. Rather you should be delegating escaping to specific backends that don't trust their input data.
You shouldn't be doing HTML escaping "all over the place". You should be doing it exactly once for every output that needs it - just like with any escaping for any backend. For SQL, you should be doing SQL escaping once, same goes for SMTP etc. Usually, you won't be doing any escaping - you'll be using a library that handles it for you.
If you are using sensible frameworks/libraries, this is not hard. I never manually apply SQL/SMTP/HTML escaping in my web apps, and I never have XSS/SQL injection vulnerabilities. If your method of building web pages requires you to remember to apply escaping, or end up with a vulnerability, you are doing it wrong.
Doing escaping at the form/http input level doesn't ensure safety, because nothing guarantees that data doesn't get into your database or system from another route. You've got to manually ensure that all inputs to your system are applying HTML escaping.
You may say that you don't have other inputs, but what if your system grows? It's often too late to go back and change your decision, because by this time you've got a ton of data, and may have compatibility with external interfaces e.g. public APIs to worry about, which are all expecting the data to be HTML escaped.
Even web inputs to the system are not safe, because often you have another layer of encoding applied e.g. you might need base64 encoded input in some entry point. Your automatic HTML escaping will miss any HTML encoded within that data. So you will have to do HTML escaping again, and remember to do, and keep track of where you have done it.
I've expanded on these here: http://lukeplant.me.uk/blog/posts/why-escape-on-input-is-a-bad-idea/
The original misconception
Do not confuse sanitation of output with validation.
While <script>alert(1);</script> is a perfectly valid username, it definitely must be escaped before showing on the website.
And yes, there is such a thing as "presentation logic", which is not related to "domain business logic". And said presentation logic is what presentation layer deals with. And the View instances in particular. In a well written MVC, Views are full-blown objects (contrary to what RoR would try to to tell you), which, when applied in web context, juggle multiple templates.
About your reasons
Different output formats should be handled by different views. The rules and restrictions, which govern HTML, XML, JSON and other formats, are different in each case.
You always need to store the original input (sanitized to avoid injections, if you are not using prepared statements), because someone might need to edit it at some point.
And storing original and the xss-safe "public" version is waste. If you want to store sanitized output, because it takes too much resources to sanitize it each time, then you are already pissing at the wrong tree. This is a case, when you use cache, instead of polluting the database.

ruby string encoding

So, I'm trying to do some screen scraping off of a certain site using nokogiri, but the site owners failed to specify the proper encoding of the page in a <meta> tag. The upshot of this is that I'm trying to deal with strings that think they're utf-8, but really aren't.
(If you care, here are the files I was using to test this:
main file: http://dpaste.de/nif5/
ann.html: http://dpaste.de/YsLM/
ann2.html: http://dpaste.de/Lofi/
ann3.html: http://dpaste.de/R21j/
a-p.html: http://dpaste.de/O9dy/
output: http://dpaste.de/WdXc/
)
After doing a lot of searching around (this SO question was particularly useful), I found that calling encode('iso-8859-1', 'utf-8') on that test string "works", in that I get a proper © symbol. The issue now is that there are other characters in some other strings I want that really do not work at being converted to latin encoding (Shōta, for instance, turns into Sh�\x8Dta).
Now, I'm probably going to bother the appropriate webmasters and try and get them to fix their damn encodings, but in the meantime, I'd like to be able to use the bytes that I've got. I'm fairly certain that there is a way, but I just can't for the life of me figure out what it is.
Those pages appear to be correctly encoded as UTF-8. That's how my browser sees them, and when I viewsource them and tell the editor to decode them as UTF-8, they look fine. The only problem I see is that some copyright symbols seem to have been corrupted before (or as) they were added to the content. The o-macron and other non-ASCII letters come through just fine.
I don't know if you're aware of this, but the proper way to notify clients of a page's encoding is through a header. Pages may include that information in <meta> tags, but that's neither required nor expected; browsers typically ignore such tags if the header is present.
Since your pages are XHTML, they could also embed the encoding information in an XML processing instruction, but again, they're not required to. But it als means you could have Nokogiri treat them as XML instead of HTML, in which case I would expect it to use UTF-8 by default. But I'm not familiar with Nokogiri, so I can't be sure. And anyway, the header is still the final authority.
So, the issue is that ANN only specifies encoding via headers, and Nokogiri doesn't receive the headers from the open() function. So, Nokogiri guesses that the page is latin-encoded, and produces strings that we really can't reverse to get back the original characters from.
You can specify the encoding to Nokogiri as the 3rd parameter to Nokogiri::HTML(), which solves the issue I was initially trying to solve. So, I'll accept this answer, even though the more specific question I asked (how to get those non-latin characters out of a latin string) is unanswerable.

HTTP response splitting

I'm trying to handle this possible exploit and wondering what is the best way to do it? should i use apache's common-validator and create a list of known allowed symbols and use that?
From the wikipedia article:
The generic solution is to URL-encode strings before inclusion into HTTP headers such as Location or Set-Cookie.
Typical examples of sanitization include casting to integer, or aggressive regular expression replacement. It is worth noting that although this is not a PHP specific problem, the PHP interpreter contains protection against this attack since version 4.4.2 and 5.1.2.
Edit
im tied into using jsp's with java actions!
There don't appear to be any JSP-based protections for this attack vector - many descriptions on the web assume asp or php, but this link describes a fairly platform-neutral way to approach the problem (jsp used as an incidental example in it).
Basically your first step is to indentify the potentially hazardous characters (CRs, LFs, etc) and then to remove them. I'm afraid this about as robust a solution as you can hope for!
Solution
Validate input. Remove CRs and LFs (and all other hazardous characters) before embedding data into any HTTP response headers, particularly when setting cookies and redirecting. It is possible to use third party products to defend against CR/LF injection, and to test for existence of such security holes before application deployment.
Use PHP? ;)
According to Wikipedia and the PHP CHANGELOG, PHP's had protection against it in PHP4 since 4.4.2 and PHP5 since 5.1.2.
Only skimmed it -- but, this might help. His examples are written in JSP.
ok, well casting to an int is not much use when reading strings, also using regex in every action which recieves input from browser could be messy, im looking for a more robust solution

Resources