I'm going to use PHP in my example, but my question applies to any programming language.
Whenever I deal with forms that can be filled out by users who are not logged in (in other words, untrusted users), I do several things to make sure it is safe to store in the database:
Verify that all of the expected fields are present in $_POST (none were removed using a tool such as Firebug)
Verify that there are no unexpected fields in $_POST. This way, a field in the database doesn't accidentally get written over.
Verify that all of the expected fields are of the expected type (almost always "string"). This way, problems don't come up if a malicious user is tinkering with the code and adds "[]" to the end of a field name, thus making PHP consider the field to be an array and then performing checks on it as though it were a string.
Verify that all of the required fields were filled out.
Verify that all of the fields (both required and optional) were filled out correctly (for example, email addresses and phone numbers are in the expected format).
Related to the previous item, but worthy of being its own item: verify that fields that are dropdown menus were submitted with values that are actually in the dropdown menu. Again, a user could tinker with the code and change the dropdown menu to be anything they want.
Sanitize all fields just in case the user intentionally or unintentionally included malicious code.
I don't believe that any of the above things are overkill because, as I mentioned, the user filling out the form is not trusted.
When it comes to admin backends, however, I'm not sure all of those things are necessary. These are the things that I still consider to be necessary:
Verify that all of the required fields were filled out.
Verify that all of the fields (both required and optional) were filled out correctly (for example, email addresses and phone numbers are in the expected format).
Sanitize all fields just in case the user intentionally or unintentionally included malicious code.
I'm considering dropping the remaining items in order to save time and have less code (and, therefore, more readable code). Is that a reasonable thing to do or is it worthwhile to treat all forms equally regardless of whether or not they are being filled out by a trusted user?
These are the only two reasons I can think of for why it might be wise to treat all forms equally:
The trusted user's credentials might be found out by an untrusted user.
The trusted user's machine could be infected with malware that messes with forms. I have never heard of such malware and doubt that this is something to be really be worried about, but it is something to consider anyway.
Thanks!
Without knowing all the details, it's hard to say.
However, in general this feels like a situation where code re-use should be possible. In other words, it feels like this boiler-plate form validation shouldn't need to be re-written for each unique form. Instead, I would aim to create some reusable external class that could be used for any form.
You mentioned PHP and there are already lots of form validation classes available:
http://www.google.com/search?gcx=w&sourceid=chrome&ie=UTF-8&q=form+validation+php+class
Best of luck!
Related
I am writing an app that has a sign-up form. This article made me doubt everything I knew about human names. My question is: does a person's name necessarily have positive length? Or can I validate names in this way and be confident that I have not denied anyone their identity?
P.S.: one might ask why am I validating at all. The answer is that this is for a school project and proper validation is a part of the mark. The article above proves that person's name can be pretty much any string of positive length but I don't know if zero length is OK.
With all types of programming, you have to draw a distinction between what is meaningful in the real world, and what is meaningful for your software solution.
How the data is to be used will validate what type of validation is required.
For instance, if your software interfaces with a government API, and the government API requires a first name and surname, you should do the same.
If you're interacting with bank accounts, you may have a single string which represents that account name, which many or may not be a human name or not, but may have other constraints around length.
If the name is only to be used for display purposes, maybe there is no point to capture the name at all, and instead you should capture a preferred display name (which doesn't needlessly assume a certain number of name components).
When writing software, you should target to make as few assumptions as possible, unless those assumptions will cause an increase in complexity of your software solution. If the software requires people to have non-empty names, then you should validate at the border that this is true.
In addition, if you were my student, you would have already lost marks for conflating null, and an empty string. In this instance, null would represent you lack data about the name, and an empty string would indicate that user has specified that their name is empty.
Also, if you decide not to validate something, you should at least leave a comment to indicate that you thought of it. If you do something unusual, it's possible a future developer may come along and fix the "bug". In addition, this helps you avoid losing marks.
From everything I've seen, it seems like the convention for escaping html on user-entered content (for the purposes of preventing XSS) is to do it when rendering content. Most templating languages seem to do it by default, and I've come across things like this stackoverflow answer arguing that this logic is the job of the presentation layer.
So my question is, why is this the case? To me it seems cleaner to escape on input (i.e. form or model validation) so you can work under the assumption that anything in the database is safe to display on a page, for the following reasons:
Variety of output formats - for a modern web app, you may be using a combination of server-side html rendering, a JavaScript web app using AJAX/JSON, and mobile app that receives JSON (and which may or may not have some webviews, which may be JavaScript apps or server-rendered html). So you have to deal with html escaping all over the place. But input will always get instantiated as a model (and validated) before being saved to db, and your models can all inherit from the same base class.
You already have to be careful about input to prevent code-injection attacks (granted this is usually abstracted to the ORM or db cursor, but still), so why not also worry about html escaping here so you don't have to worry about anything security-related on output?
I would love to hear the arguments as to why html escaping on page render is preferred
In addition to what has been written already:
Precisely because you have a variety of output formats, and you cannot guarantee that all of them will need HTML escaping. If you are serving data over a JSON API, you have no idea whether the client needs it for a HTML page or a text output (e.g. an email). Why should you force your client to unescape "Jack & Jill" to get "Jack & Jill"?
You are corrupting your data by default.
When someone does a keyword search for 'amp', they get "Jack & Jill". Why? Because you've corrupted your data.
Suppose one of the inputs is a URL: http://example.com/?x=1&y=2. You want to parse this URL, and extract the y parameter if it exists. This silently fails, because your URL has been corrupted into http://example.com/?x=1&y=2.
It's simply the wrong layer to do it - HTML related stuff should not be mixed up with raw HTTP handling. The database shouldn't be storing things that are related to one possible output format.
XSS and SQL Injection are not the only security problems, there are issues for every output you deal with - such as filesystem (think extensions like '.php' that cause web servers to execute code) and SMTP (think newline characters), and any number of others. Thinking you can "deal with security on input and then forget about it" decreases security. Rather you should be delegating escaping to specific backends that don't trust their input data.
You shouldn't be doing HTML escaping "all over the place". You should be doing it exactly once for every output that needs it - just like with any escaping for any backend. For SQL, you should be doing SQL escaping once, same goes for SMTP etc. Usually, you won't be doing any escaping - you'll be using a library that handles it for you.
If you are using sensible frameworks/libraries, this is not hard. I never manually apply SQL/SMTP/HTML escaping in my web apps, and I never have XSS/SQL injection vulnerabilities. If your method of building web pages requires you to remember to apply escaping, or end up with a vulnerability, you are doing it wrong.
Doing escaping at the form/http input level doesn't ensure safety, because nothing guarantees that data doesn't get into your database or system from another route. You've got to manually ensure that all inputs to your system are applying HTML escaping.
You may say that you don't have other inputs, but what if your system grows? It's often too late to go back and change your decision, because by this time you've got a ton of data, and may have compatibility with external interfaces e.g. public APIs to worry about, which are all expecting the data to be HTML escaped.
Even web inputs to the system are not safe, because often you have another layer of encoding applied e.g. you might need base64 encoded input in some entry point. Your automatic HTML escaping will miss any HTML encoded within that data. So you will have to do HTML escaping again, and remember to do, and keep track of where you have done it.
I've expanded on these here: http://lukeplant.me.uk/blog/posts/why-escape-on-input-is-a-bad-idea/
The original misconception
Do not confuse sanitation of output with validation.
While <script>alert(1);</script> is a perfectly valid username, it definitely must be escaped before showing on the website.
And yes, there is such a thing as "presentation logic", which is not related to "domain business logic". And said presentation logic is what presentation layer deals with. And the View instances in particular. In a well written MVC, Views are full-blown objects (contrary to what RoR would try to to tell you), which, when applied in web context, juggle multiple templates.
About your reasons
Different output formats should be handled by different views. The rules and restrictions, which govern HTML, XML, JSON and other formats, are different in each case.
You always need to store the original input (sanitized to avoid injections, if you are not using prepared statements), because someone might need to edit it at some point.
And storing original and the xss-safe "public" version is waste. If you want to store sanitized output, because it takes too much resources to sanitize it each time, then you are already pissing at the wrong tree. This is a case, when you use cache, instead of polluting the database.
I have received requirements that ask to normalize text box content when the user changes the focus to another control on the same data input form. Example normalizations:
whitespace at the start and end of the input is trimmed
If the text box was made empty and this is not valid, replace the content of the text box with the default value
I have a feeling that this is not in line with good GUI design. I have read the Windows UX Guidelines for text boxes but I did not immediately find any relevant rules.
Is normalizing text box content in this way acceptable?
I have definitely seen this before (examples elude me right now) but I personally don't like it when the UI changes my input.
If the UI is smart enough to change my input on me then it should accept it as is and change the value when it needs to process it.
When the input changes auto-magically you are now forcing the user to stop and ask themselves why it changed and if they did something wrong or if the application has an error. Don't make the user think!
Generally, you should accept user input exactly has they entered it. Chances are users did it that way for a good reason. For example, imagine a user entering a foreign address, and then your app screws it up trying to format like a domestic address. At the very least, users entered the input the way they’re used to it being, so changing it can make it hard for them to cross-check it.
However, there are several exceptions:
Add defaults to incomplete input. Adding input the user left off (e.g., years to dates, units to dimensions) provides good feedback on how the app is interpreting the input that would otherwise be ambiguous. This also encourages the user to use defaults, making their input more efficient.
Resolve other ambiguities. Change to an unambiguous format if the user’s format is open to interpretation. For example, if you have international users, you may want to change “9-8-09” to “Sep 8 2009” (or “9 Aug 2009”) to provide feedback on what your app considers the month and day to be.
Add delimiters when none provided. Automagically adding standard or even arbitrary delimiters to long alphanumeric strings (e.g., phone numbers, credit card numbers, serial numbers) provides an input display that the users can crosscheck more easily. Sometimes users may enter a string without delimiters in order to go faster or because they are the victim of web abuse by sites that refuse to accept even standard delimiters.
Spelling, grammar, and capitalization correction. Users often appreciate this, but only if there’s also a means to override it. Some users like to use "i" as the first person pronoun.
If the field is used by more than one user, then you probably should automatically format the value in some standard way that accommodates the majority of your users, but that should be done when the value is stored on the backend, not when focus leaves the field. For example, if a user enters a time of 15:30 it should remain as 15:30 as long as the user views the page. However, the next time a user (any user) retrieves the data, it should appear as 3:30pm (if that’s how most of your users are used to seeing time).
Such backend formatting applies to trimming whitespace so that all users can search, find, and sort on the field consistently. It’s probably not a good idea to replace a blank value (or any invalid value) with the default because users are unlikely to anticipate getting that value. An exception would perhaps be changing blank to 0 for numeric fields in situations where obviously blank == none == zero, but again this probably should be done when storing in the backend, not in the field itself. If blank is ambiguous, (e.g., may mean 0 or may mean "I don't know") then the second bullet above applies, and you may want to autocorrect in the field when focus is lost.
Of course, if your users vary in how they need to have a data type formatted, then you can have different variants of the app that display the data type in different ways for different user groups, or you can make the format of the data type a user preference, but that’s really another issue.
If the user wants it, and the Stakeholder ask for it, then is perfectly safe.
Trimming is very common. and the replace is common when you are talking about filling textbox with numbers. (a 0 instead of a blank).
It's a fairly standard feature, especially the whitespace trimming. The default value replacement raises a larger flag just because it is less common.
I'm pretty sure that I've seen versions of Microsoft Office that do this - putting "pt." after a value in points, for instance. Microsoft's endorsement should be a good sign.
We have quite a few of these kind of requirement. The reason given for forcing a default value rather than a blank space is that it looks better in reports or if the client wants to see the live system. A blank looks a bit like "couldn't be bothered to enter anything". For a similar reason, we often upper-case the text for consistency as the users never use consistent formatting.
In the very common scenario where you have a textbox, and some kind of validation rule which constraints the valid entries on that textbox. How should a failing validation affect the content of the textbox especially in the case when there originally was a valid value entered before the invalid.
Example: Imagine a form where one can enter a number between 0 and 50. The user enteres 40. Everything is fine. But then the user goes in and changes it to 59.
Obviously IMHO the application should inform him about his mistake asap. But what to do with the values? I think there should be a way back to the 40 as an easy way to a valid state, but I'm not sure when and how to revert it: On focus lost? Only on a special key/ button press?
What do you think?
Edit: I absolutely agree with the first two answers: changing the input automatically is a bad idea. Yet I would like to keep the 'last valid' value available ... Maybe a clean UnDo feature would to the trick?
The approach that I take to field validation like this is pretty simple:
Thou Shalt Not Discard the Users Input
If the entry made by the user is invalid, flag it as such - but don't change the value.
The reasoning behind you is simple - consistency.
Consider, for example, a pair of fields used for entry of a date range where "valid" means that the start date is before the end date.
Now, the user wants to enter a completely new range.
If your system discards invalid entries immediately, you force your user into different behaviour. For entry of an earlier date range, the start date must be entered first; for entry of a later date range, the final date must be entered first. Unfriendly.
Instead, respect the users input - when the start date is entered, freely flag it as invalid, but leave the value in place. Then, when the final date is entered, both fields now validate.
This is also motivation for displaying validation dynamically (as values are changed), but for not limiting the users movement between fields.
Depends on you functional specs really, but yeah falling back to the previous correct value sounds like a good practice ... for a blocking error which will always be blocking.
However do you really want to block your user there? What if the valid range is 0-50 when option b is selected but becomes 0-60 when option c is selected? And the user decides to change the ranged value first? Then the user will be very frustrated at having lost what she considers a perfectly valid value...
Keep that in mind when automatically reverting a change made by a user :) The user may have made a mistake but may also have made the change intentionnaly because he associated it to another change in his mind but can't change both at the same time in the application...
Warn that the value is invalid and let ctrl-Z cancel the stuff the user put there may be a more sensible default.
#Bevan Thou art righteous.
If you want to see an example of how annoying it can be crack open Google Analytics
This unfriendly behaviour is exactly what Google Analytics does when you try to compare dates, and it drives me crazy.
If you enter an end date which is before your start date, it discards your entry and thus forces you to enter information in a prescribed sequence.
It also means that small typo's can mean you being forced to retype a whole date, which just sucks.
I agree with the other answers in not changing the user input, you cannot tell what about it may be wrong, perhaps it was a typo, an missing decimal, swapped day/month/year fields, etc. An undo option to allow them to revert back to previous might be an added nice feature however.
The main items in my view are:
- make it clear what the valid ranges and formats are BEFORE the user enteres the data in the form, via examples or other similar indicators.
- be sure to indicate mandatory fields clearly prior to submision.
- user controls that limit the user to entering valida data - date pickers, spin controls, numeric only controls, max lengths set on textboxes, etc.
- make it clear when validation fails as to what item(s) on the form are invalid and why they are invalid, not just a simple "data is incorrect" global message, espeically if you have a long form with many fields that could have validation issues.
When you're passing variables through your site using GET requests, do you validate (regular expressions, filters, etc.) them before you use them?
Say you have the URL http://www.example.com/i=45&p=custform. You know that "i" will always be an integer and "p" will always contain only letters and/or numbers. Is it worth the time to make sure that no one has attempted to manipulate the values and then resubmit the page?
Yes. Without a doubt. Never trust user input.
To improve the user experience, input fields can (and IMHO should) be validated on the client. This can pre-empt a round trip to the server that only leads to the same form and an error message.
However, input must always be validated on the server side since the user can just change the input data manually in the GET url or send crafted POST data.
In a worst case scenario you can end up with an SQL injection, or even worse, a XSS vulnerability.
Most frameworks already have some builtin way to clean the input, but even without this it's usually very easy to clean the input using a combination of regular exceptions and lookup tables.
Say you know it's an integer, use int.Parse or match it against the regex "^\d+$".
If it's a string and the choices are limited, make a dictionary and run the string through it. If you don't get a match change the string to a default.
If it's a user specified string, match it against a strict regex like "^\w+$"
As with any user input it is extremely important to check to make sure it is what you expect it is. So yes!
Yes, yes, and thrice yes.
Many web frameworks will do this for you of course, e.g., Struts 2.
One important reason is to check for sql injection.
So yes, always sanitize user input.
not just what the others are saying. Imagine a querystring variable called nc, which can be seen to have values of 10, 50 and 100 when the user selects 10, 50 and 100 results per page respectively. Now imagine someone changing this to 50000. If you are just checking that to be an integer, you will be showing 50000 results per page, affecting your pageviews, server loads, script times and so on. Plus this could be your entire database. When you have such rules (10, 50 or 100 results per page), you should additionaly check to see if the value of nr is 10, 50 or 100 only, and if not, set it to a default. This can simply be a min(nc, 100), so it will work if nc is changed to 25, 75 and so on, but will default to 100 if it sees anything above 100.
I want to stress how important this is. I know the first answer discussed SQL Injection and XSS Vulnerabilities. The latest rave in SQL Injection is passing a binary encoded SQL statement in query strings, which if it finds a SQL injection hole, it will add a http://reallybadsite.com'/> to every text field in your database.
As web developers we have to validate all input, and clean all the output.
Remember a hacker isn't going to use IE to compromise your site, so you can't rely on any validation in the web.
Yes, check them as thoroughly as you can. In PHP I always check the types (IsInt(i), IsString(p)).