I do not want to be too strict as there may be thousands of possible characters in a possible first name
Normal english alphabets, accented letters, non english letters, numbers(??), common punctuation synbols
e.g.
D'souza
D'Anza
M.D. Shah (dots and space)
Al-Rashid
Jatin "Tom" Shah
However, I do not want to except HTML tags, semicolons etc
Is there a list of such characters which is absolutely bad from a web application perspective
I can then use RegEx to blacklist these characters
Background on my application
It is a Java Servlet-JSP based web app.
Tomcat on Linux with MySQL (and sometimes MongoDB) as a backend
What I have tried so far
String regex = "[^<>~##$%;]*";
if(!fname.matches(regex))
throw new InputValidationException("Invalid FirstName")
My question is more on the design than coding ... I am looking for a exhaustive (well to a good degree of exhaustiveness) list of characters that I should blacklist
A better approach is to accept anything anyone wants to enter and then escape any problematic characters in the context where they might cause a problem.
For instance, there's no reason to prohibit people from using <i> in their names (although it might be highly unlikely that it's a legit name), and it only poses a potential problem (XSS) when you are generating HTML for your users. Similarly, disallowing quotes, semi-colons, etc. only make sense in other scenarios (SQL queries, etc.). If the rules are different in different places and you want to sanitize input, then you need all the rules in the same place (what about whitespace? Are you gong to create filenames including the user's first name? If so, maybe you'll have to add that to the blacklist).
Assume that you are going to get it wrong in at least one case: maybe there is something you haven't considered for your first implementation, so you go back and add the new item(s) to your blacklist. You still have users who have already registered with tainted data. So, you can either run through your entire database sanitizing the data (which could take a very very long time), or you can just do what you really have to do anyway: sanitize data as it is being presented for the current medium. That way, you only have to manage the sanitization at the relevant points (no need to protect HTML output from SQL injection attacks) and it will work for all your data, not just data you collect after you implement your blacklist.
Related
I'm currently working on an internationalisation project for a large web application - initially we're just implementing French but more languages will follow in time. One of the issues we've come across is how to display adjectives.
Let's take "Active" as an example. When we received translations back from the company we're using, they returned "Actif(ve)", as English "Active" translates to masculine "Actif" or feminine "Active". We're unsure of how to display this, and wondered if there are any well established conventions in the web development world.
As far as I see it there are three possible scenarios:
We know at development time which noun a given adjective is referring to. In this case we can determine and use the correct gender.
We're referring to a user, either directly ("you") or in the third person. Short of making every user have a gender, I don't see a better approach than displaying both, i.e. "Actif(ve)"
We are displaying the adjective in isolation, not knowing which noun it's referring to. For example in a table of data, some rows might be dealing with a masculine entity, some feminine.
Scenarios 2 and 3 seem to be the toughest ones. Does anyone have any experience handling these issues? Any tips would be appreciated!
This is complex, because we cannot imagine all the cases, and there is risk to go in "opinion based" answer, so I keep it short and generic.
Usually I prefer to give context in translation (for translator), e.g. providing template: _("active {user_name}" (so also the ordering will be correct if languages want different ordering).
Then you may need to change code and template into _("active {first_name_feminine}") and _("active {first_name_masculine}") (and possibly more for duals, trials, plurals, collectives, honorific, etc.). Note: check that the translator will not mangle the {} and the string inside. Usually you need specific export/import scripts. Or I add a note inside the string, and I quickly translate into English removing the note to the translator). Also this can be automated (be creative on using special Unicode characters which should not be used in normal text, to delimit such text).
But if you cannot know the gender, the Actif(ve) may be the polite version used in such language. You need a native speaker test, and changes back and forth.
I often encounter advice for protecting a web application against a number of vulnerabilities, like SQL injection and other types of injection, by doing input validation.
It's sometimes even said to be the single most important technique.
Personally, I feel that input validation for security reasons is never necessary and better replaced with
if possible, not mixing user input with a programming language at all (e.g. using parameterized SQL statements instead of concatenating input in the query strings)
escaping user input before mixing it with a programming or markup language (e.g. html escaping, javascript escaping, ...)
Of course for a good UX it's best to catch input that would generate errors on the backand early in the GUI, but that's another matter.
Am I missing something or is the only purpose to try to make up for mistakes against the above two rules?
Yes you are generally correct.
A piece of data is only dangerous when "used". And it is only dangerous if it has special meaning in the context it is used.
For example, <script> is only dangerous if used in output to an HTML page.
Robert'); DROP TABLE Students;-- is only dangerous when used in a database query.
Generally, you want to make this data "safe" as late as possible. Such as HTML encoding when output as HTML to an HTML page, and parameterised when inserting into a database. The big advantage of this is that when the data is later retrieved from these locations, it will be returned in its original, unsanitized format.
So if you have the value A&B O'Leary in an input field, it would be encoded like so:
<input type="hidden" value="A& O'Leary" />
and if this is submitted to your application, your programming framework will automatically decode it for you back to A&B O'Leary. Same with your DB:
string name = "A&B O'Leary";
string sql = "INSERT INTO Customers (Name) VALUES (#Name)";
SqlCommand command = new SqlCommand(sql);
command.Parameters.Add("#Name", name];
Simples.
Additionally if you then need to give the user any output in plain text, you should retrieve it from your DB and spit it out. Or in JavaScript - you just JavaScript entity encode (although best avoided for complexity reasons - I find it easier to secure if I only output to HTML then read the values from the DOM).
If you'd HTML encoded it early, then to output to JavaScript/JSON you'd first have to convert it back then hex entity encode it. It will get messy and some developers will forget they have to decode first and you will have &s everywhere.
You can use validation as an additional defence, but it should not be the first port of call. For example, if you are validating a UK postcode you would want to whitelist the alphanumeric characters in upper and lower cases. Any other characters would be rejected or removed by your application. This can reduce the chances of SQLi or XSS occurring on your application, but this method falls down where you need inputs to include characters that have special meaning to your output context (" '<> etc). For example, on Stack Overflow if they did not allow characters such as these you would be preventing questions and answers from including code snippets which would pretty much make the site useless.
Not all SQL statements are parameterizable. For example, if you need to use dynamic identifiers (as opposed to literals). Even whitelisting can be hard, sometimes it needs to be dynamic.
Escaping XSS on output is a good idea. Until you forget to escape it on your admin dashboard too and they steal all your admin's cookies. Don't let XSS in your database.
From everything I've seen, it seems like the convention for escaping html on user-entered content (for the purposes of preventing XSS) is to do it when rendering content. Most templating languages seem to do it by default, and I've come across things like this stackoverflow answer arguing that this logic is the job of the presentation layer.
So my question is, why is this the case? To me it seems cleaner to escape on input (i.e. form or model validation) so you can work under the assumption that anything in the database is safe to display on a page, for the following reasons:
Variety of output formats - for a modern web app, you may be using a combination of server-side html rendering, a JavaScript web app using AJAX/JSON, and mobile app that receives JSON (and which may or may not have some webviews, which may be JavaScript apps or server-rendered html). So you have to deal with html escaping all over the place. But input will always get instantiated as a model (and validated) before being saved to db, and your models can all inherit from the same base class.
You already have to be careful about input to prevent code-injection attacks (granted this is usually abstracted to the ORM or db cursor, but still), so why not also worry about html escaping here so you don't have to worry about anything security-related on output?
I would love to hear the arguments as to why html escaping on page render is preferred
In addition to what has been written already:
Precisely because you have a variety of output formats, and you cannot guarantee that all of them will need HTML escaping. If you are serving data over a JSON API, you have no idea whether the client needs it for a HTML page or a text output (e.g. an email). Why should you force your client to unescape "Jack & Jill" to get "Jack & Jill"?
You are corrupting your data by default.
When someone does a keyword search for 'amp', they get "Jack & Jill". Why? Because you've corrupted your data.
Suppose one of the inputs is a URL: http://example.com/?x=1&y=2. You want to parse this URL, and extract the y parameter if it exists. This silently fails, because your URL has been corrupted into http://example.com/?x=1&y=2.
It's simply the wrong layer to do it - HTML related stuff should not be mixed up with raw HTTP handling. The database shouldn't be storing things that are related to one possible output format.
XSS and SQL Injection are not the only security problems, there are issues for every output you deal with - such as filesystem (think extensions like '.php' that cause web servers to execute code) and SMTP (think newline characters), and any number of others. Thinking you can "deal with security on input and then forget about it" decreases security. Rather you should be delegating escaping to specific backends that don't trust their input data.
You shouldn't be doing HTML escaping "all over the place". You should be doing it exactly once for every output that needs it - just like with any escaping for any backend. For SQL, you should be doing SQL escaping once, same goes for SMTP etc. Usually, you won't be doing any escaping - you'll be using a library that handles it for you.
If you are using sensible frameworks/libraries, this is not hard. I never manually apply SQL/SMTP/HTML escaping in my web apps, and I never have XSS/SQL injection vulnerabilities. If your method of building web pages requires you to remember to apply escaping, or end up with a vulnerability, you are doing it wrong.
Doing escaping at the form/http input level doesn't ensure safety, because nothing guarantees that data doesn't get into your database or system from another route. You've got to manually ensure that all inputs to your system are applying HTML escaping.
You may say that you don't have other inputs, but what if your system grows? It's often too late to go back and change your decision, because by this time you've got a ton of data, and may have compatibility with external interfaces e.g. public APIs to worry about, which are all expecting the data to be HTML escaped.
Even web inputs to the system are not safe, because often you have another layer of encoding applied e.g. you might need base64 encoded input in some entry point. Your automatic HTML escaping will miss any HTML encoded within that data. So you will have to do HTML escaping again, and remember to do, and keep track of where you have done it.
I've expanded on these here: http://lukeplant.me.uk/blog/posts/why-escape-on-input-is-a-bad-idea/
The original misconception
Do not confuse sanitation of output with validation.
While <script>alert(1);</script> is a perfectly valid username, it definitely must be escaped before showing on the website.
And yes, there is such a thing as "presentation logic", which is not related to "domain business logic". And said presentation logic is what presentation layer deals with. And the View instances in particular. In a well written MVC, Views are full-blown objects (contrary to what RoR would try to to tell you), which, when applied in web context, juggle multiple templates.
About your reasons
Different output formats should be handled by different views. The rules and restrictions, which govern HTML, XML, JSON and other formats, are different in each case.
You always need to store the original input (sanitized to avoid injections, if you are not using prepared statements), because someone might need to edit it at some point.
And storing original and the xss-safe "public" version is waste. If you want to store sanitized output, because it takes too much resources to sanitize it each time, then you are already pissing at the wrong tree. This is a case, when you use cache, instead of polluting the database.
I'm working on a Web application that uses tags like Stackoverflow tags. I've noticed that a lot of sites that use tags make them space-delimited, which disallows a tags like...
"favorite recipes"
Instead they enforce this...
"favorite-recipes" | "favorite_recipes" | "FavoriteRecipes"
If the tags were comma-delimited, an item could have a set of tags like...
"cats, birds, favorite recipes, horses"
I have to decide on the policy for my app.
I guess I like the idea of space-delimited, but if my users aren't programmers they might be more comfortable with the familiar idea of commas denoting a list.
Why are comma-delimited tags unusual? Is there a major downside to them?
I think space delimited tags feel more like typing and you aren't syntactically bound to commas or pipes etc...
I also think that not allowing spaces in tags is kind of nice because you don't have to worry about turning them into URLs they are already ready to go.
Online tools like Delicious use space delimiting for tags, but other tools like Wordpress allow you to have spaces in tags so they require commas. If you do things this way, you will have to create similar tag "slugs" to make sure they work in a URL cleanly (i.e. my tag would be my-tag)
Remember, allowing too much freedom in your tag creation can result in some pretty crazy tags like "Things I like to do on a Saturday afternoon..." etc.
I think the idea is that a tag should be short and to the point.
If you go to type in "favorite recipes", when it doesn't work, you think to yourself "Oh, I should redo this." So you make "favorites" "recipes" instead.
If it was something you really needed, like "pork roast" then it makes sense to make it "pork-roast" but only after you've thought about really joining those works. Perhaps you shouldn't though - perhaps it should be "pork" "roast" so it shows up under searches for pork and searches for roast.
tl;dr It's for user experience and searching so they don't enter something that can't be easily searched for.
Yes, I also think that space-separated tags reduce complexity. And you can enter them faster, because you have the big space key to separate them. Maybe the brain is also less loaded, because it doesn't need to think about where to put a comma delimiter and where not.
I built an app with tags once, and used commas. The only down side to commas is that you have to do more thorough checking of empty tags. For instance in the example:
"George Bush, Bill Clinton, Barack Obama, "
If someone posts the tags with a comma at the end and there is a space it generally will get added to the database.
This is because if you simply check to delete once space you would turn Bill Clinton into BillClinton.
However you can make sure that the tag is at least a certain amount of characters to ensure there is not empty tags. This will not ensure that there is not three or four spaces in a row though.
Anther thing to note is that humans generally put spaces before words and after commas.
So in the example above there is a space after each comma and before each president. This space will be included in the database making the tag:
" Bill Clinton"
instead of
"Bill Clinton"
as the user probably attempted.
Once again you can eliminate both beginning and trailing spaces, but there is more server side code to implement in this case.
If you just use spaces then you can eliminate any unwanted characters like commas etc, and put the tags into an array using space characters to separate them.
I actually prefer tags that are comma or semi-colon delimited for two basic reasons:
It's more natural (user-centric not techie-centric)
It reduces redundancy, as the example noted "favorite-recipes"
"favorite_recipes" "FavoriteRecipes"
And I'm afraid some of the other answers sort of make it sound as if web developers are (how shall I say this) less willing to do the work than, for instance, database developers - who've been addressing some of these same issues for years. (I've coded for web and client-server databases.)
To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.