First Name and Last Name Validation vs XSS Attack - validation

My online research seems to show that firstnames and lastnames should not be heavily validated, to accommodate the variety of names out there. In fact, people have even advocated no validation altogether for the names. However, the possibility of xss attacks via the input fields make me worried. I checked the google naming guidelines, and they seem pretty relaxed and allow unicode characters as well as stuff like "%$#^&*...." !!
So, what would be the best approach to take, and how do I balance this out ?
ps - I don't intend to spark a debate here. I am genuinely confused and need help understanding the best approach to take !

Validation and XSS are two very different concepts. You cannot balance them. You cannot "sometimes allow XSS". You also do not want to allow input that does not make sense, or that you can't use. If you require an email for something, you can allow an user to enter "mailme at gmail dot com", but if you do not know how to parse this, then there is no point in allowing this as an input in the first place.
When you talk about validating a 'first name field', you ask yourself: "What kind of data do I want to accept in this field, and what kind of data do I not want to accept in this field?". I am not aware of a language where "%" can be part of a first name, so it is probably a safe bet to disallow this character. You have to tackle this problem alone, without even thinking about XSS. If a character, or a sequence of characters, does not make sense as a value for the thing you want to know, you should not include it. If a character does make sense to include, you should not decide otherwise because it has some special meaning.
XSS is the problem where incorrectly escaped (user) input is returned to the browser, allowing a possible attacker to load/run third-party scripts. It has nothing to do with validation. If the character "a" is potentially unsafe, would you disallow it from the first name field? The solution can be found in the definition: The problem exists if, and only if, the user input is incorrectly escaped.
Think about how you are going to sent back this data to the user. I take as example an input field: <input value="" />, but if you were going to put it in a textarea for example, you would need to alter your data for that. Inserting it between a <div></div> tag would require something entirely different again, and inserting it inside a script that is in <script></script> tags would require something different than all the previous things. There is no one-size-fits-all-solution.
For the input field example, find out what characters have a special meaning in this input field. The delimiter of the value attribute (" in value="") is one of the characters that has a special meaning. If there are any other special characters, you find them in accompanying documentation. You have to escape such characters. Escaping is the act of removing the special meaning from a character. How you do that can be found in the accompanying documentation. In case of an input element in html, you'll need to turn the special character into it's entity-form (" would become "). Php provides built-in functions to do this, but you should always be wary of what such a function actually does and if this function actually gives the desired output for every use-case.
tl;dr There is no balance. You use validation on a field to get the data you actually want. If you want to present this data to the user, you have to escape the data for the special case where you want to display this data.
Example: Let's look at the following case. We have a textarea. We allow the characters a-z, <, >, (, ), {, }, / and ; in any order. If the textarea contains other characters we consider it invalid. If the textarea is valid, we put the characters in the textarea between <div> and </div> in the html document.
From the definition above, you can derive that asdf is a valid input and that <scri (random nonsense to bypass faulty proxy) pt>alert();</sc (more random nonsense) ript> is also a valid input, but 123 isn't. That is the definition. The logic that handles validation should flawlessly discriminate between those two things. You probably notice that the second valid input may provide a problem, but that is of no concern to the validate function. The validate function only checks if the text matches the description of what we consider valid input.
If the text in the textarea is valid, the definition says we should put it between the div tags. This is where you start worrying about XSS. There are some characters, namely the < and > character that have a special meaning in html. Because they are valid input, we should remove their special meaning when we insert them in the html. If the textarea is invalid, we can't do anything. We would display a descriptive error message how it should be improved.
The pseudo-implementation below shows what I try to explain above. In a real-life application that communicates with the server, the server should do validation too, but it should show how both concepts are separated and should allow you to test things.
$('#billy').on('click', function(e) {
if (validate($('#txt').val())) {
$('#status').text("The textarea is valid. The contents have been inserted as html in the page.");
$('#result').html($('#txt').val());
} else {
$('#status').text("The textarea is INVALID. It contains characters we don't want.");
}
});
$('#betty').on('click', function(e) {
if (validate($('#txt').val())) {
$('#status').text("The textarea is valid. The contents have been inserted as html in the page.");
$('#result').html(escapeforhtml($('#txt').val()));
} else {
$('#status').text("The textarea is INVALID. It contains characters we don't want.");
}
});
function validate(txt) {
return txt.match(/^[a-z{}\/\(\)<>;]*$/);
}
//We know only a limited amount of characters can be inserted.
//From those, < and > are the only characters that have a special
//meaning.
function escapeforhtml(txt) {
return txt.replace(/</g, "<").replace(/>/g, ">");
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<div id="status"></div>
<textarea id="txt" cols="60" rows="10"></textarea>
<br/>
<input type="button" id="billy" value="Do something">
<input type="button" id="betty" value="Do something while ESCAPING">
<p>Result:</p>
<div id="result"></div>

Related

Issues with Parameters containg whitespaces

i've the following situation with Freemarker.
When returning to a page in .ftl i send from Java a parameter to the url, similar to "AAA% BBB#DDD.COM", in Java it is ok.
When looking at the Url it does instead Write : "AAA%25+BBB#DDD.COM" And then with the following code:
<#if myCase??>
value = ${user}
</#if>
It does write in my html field "AAA%" but not the remaining.
How can i try to solve this issue?
Thanks in advance.
EDIT: After further investigations i do see the code i put before does write this on the Html:
value="AAA%" BBB#CCC.com=""
EDIT2: Let'see if i can give you more informations, first of, here's the relevant Java code :
Map mapping = new HashMap();
if(user != null && !user.isEmpty()){
mapping.put("user",user); //EG: AAA% BBB#DDD.COM (Checked in debug)
}
I have an URL similar to : mysite.xx?user=AAA%25+BBB#DDD.COM so the user it's attached as query param of the url.
I do need to reuse the "user" param to repopulate the Form field relative to the username, this is not a valid email i know, but an alias system already installed by the customer does the aliasing system this way.
What could be the cause of the problem
Given your template:
<#if myCase??>
value = ${user}
</#if>
Output written by Freemarker in output-mode HTML results in following:
value = AAA% BBB#DDD.COM
Freemarker does not understand that (from your context) the value of user should be an attribute-value (assignment). Instead it treats the contents of string user as HTML itself (this could be complete HTML-source as input-field, single tags, etc.). It simply pastes the contents of the model at the position in your template where you have set the variable-interpolation ${user}.
The Freemarker-result is no valid HTML (attribute-value pair), because each attribute should adhere some naming-conventions (i.e. no special-characters). When the attribute has a value, it is followed by an equal-sign and this followed by the value enclosed in double-quotes.
So most browsers convert your result into a valid HTML attribute - actually two attributes: value="AAA%" and BBB#CCC.com="". Opened the output-HTML in Firefox, you will see this in Inspector (NOT IN the raw source-view):
<input type="text" value="AAA%" bbb#ddd.com="">
What is not the cause
FreeMarker is auto-escaping (escpecially when in OutputMode HTML) when it writes the final HTML.
#ddekany Thanks for your comment ! It made me reproduce and discover the real cause.
URL encoding/decoding
In Java you could even encode the string variable user. So it converts % (i.e. percent-sign followed by space) into %25+ which is valid to be used inside an URL.
Run this java snippet online on IDEONE to the effects of URL-encoding and URL-decoding.
Solutions
Use either of these solutions to get desired output by fixing the HTML-attribute value-assignment in your template:
(1) use double-quotes:
<#if myCase??>
value="${user}"
</#if>
(2) use some built-ins to transform the plain string-output:
Use some of FreeMarker's built-ins for strings. In your case you could append ?url to the variable-name and use double-quotes around your variable-interpolation within your template, e.g.:
<#if myCase??>
href="mailto:${user?url}"
</#if>
Caution: validate URL or email-address (even parts of it) as early as possible
BBB#DDD.COM is a valid email-address. But % and whitespaces are not allowed inside an email-address.
On the other side # is typically not part of an URL, except as part inside a query-param value. But your user (URL) does not start with http:// etc.
So depending on the use-case/purpose of your (so called URL) user with value AAA% BBB#DDD.COM it could finally represent part of an URL or email-address.
In your special case, said:
populate the form field relative to the username. Model-variable user does not contain a valid email-address. It is used in conjunction with an alias system already installed by the customer. So aliasing will work this way.
Let's suppose the end-user which does later edit the form-field is responsible of making it valid (or a script does this validation).
Anyway bear in mind that an internet-address (like URL/email) needs some validation:
either before written to the final HTML (using Java or Freemarker)
or after being further processed inside your web-page (using JavaScript).
Otherwise it could possibly not yield the desired effect.
See also
Related questions:
Is there any way to url decode variable on Freemarker?
Java URL encoding of query string parameters

Processing form input in a Joomla component

I am creating a Joomla component and one of the pages contains a form with a text input for an email address.
When a < character is typed in the input field, that character and everything after is not showing up in the input.
I tried $_POST['field'] and JFactory::getApplication()->input->getCmd('field')
I also tried alternatives for getCmd like getVar, getString, etc. but no success.
E.g. John Doe <j.doe#mail.com> returns only John Doe.
When the < is left out, like John Doe j.doe#mail.com> the value is coming in correctly.
What can I do to also have the < character in the posted variable?
BTW. I had to use & lt; in this question to display it as I want it. This form suffers from the same problem!!
You actually need to set the filtering that you want when you grab the input. Otherwise, you will get some heavy filtering. (Typically, I will also lose # symbols.)
Replace this line:
JFactory::getApplication()->input->getCmd('field');
with this line:
JFactory::getApplication()->input->getRaw('field');
The name after the get part of the function is the filtering that you will use. Cmd strips everything but alphanumeric characters and ., -, and _. String will run through the html clean tags feature of joomla and depending on your settings will clean out <>. (That usually doesn't happen for me, but my settings are generally pretty open to the point of no filtering on super admins and such.
getRaw should definitely work, but note that there is no filtering at all, which can open security holes in your application.
The default text filter trims html from the input for your field. You should set the property
filter="raw"
in your form's manifest (xml) file, and then use getRaw() to retrieve the value. getCmd removes the non-alphanumeric characters.

How do you check for a changing value within a string

I am doing some localization testing and I have to test for strings in both English and Japaneses. The English string might be 'Waiting time is {0} minutes.' while the Japanese string might be '待ち時間は{0}分です。' where {0} is a number that can change over the course of a test. Both of these strings are coming from there respective property files. How would I be able to check for the presence of the string as well as the number that can change depending on the test that's running.
I should have added the fact that I'm checking these strings on a web page which will display in the relevant language depending on the location of where they are been viewed. And I'm using watir to verify the text.
You can read elsewhere about various theories of the best way to do testing for proper language conversion.
One typical approach is to replace all hard-coded text matches in your code with constants, and then have a file that sets the constants which can be updated based on the language in use. (I've seen that done by wrapping the require of that file in a case statement based on the language being tested. Another approach is an array or hash for each value, enumerated by a variable with a name like 'language', which lets the tests change the language on the fly. So validations would look something like this
b.div(:id => "wait-time-message).text.should == WAIT_TIME_MESSAGE[language]
To match text where part is expected to change but fall within a predictable pattern, use a regular expression. I'd recommend a little reading about regular expressions in ruby, especially using unicode regular expressions in ruby, as well as some experimenting with a tool like Rubular to test regexes
In the case above a regex such as:
/Waiting time is \d+ minutes./ or /待ち時間は\d+分です。/
would match the messages above and expect one or more digits in the middle (note that it would fail if no digits appear, if you want zero or more digits, then you would need a * in place of the +
Don't check for the literal string. Check for some kind of intermediate form that can be used to render the final string.
Sometimes this is done by specifying a message and any placeholder data, like:
[ :waiting_time_in_minutes, 10 ]
Where that would render out as the appropriate localized text.
An alternative is to treat one of the languages as a template, something that's more limited in flexibility but works most of the time. In that case you could use the English version as the string that's returned and use a helper to render it to the final page.

How to define a custom escape sequence for a string, and then parse it?

I'm writing a simple Ruby on Rails app. I have a model with a "description" attribute, which is a string.
I'd like to display this string in a view, but have some of the words in the string rendered using a special music font (one of the ones located here), and the rest to use the main font of the website. Problem is, since the description attribute is just a string that is persisted to the database, there's no real way to tell which words should use the special font...
The only way I can think of would be to define my own "escape sequence" or "special characters" that would allow me to indicate to the view whether a word should use the special font.
For example, say I have the following string:
cat dog rabbit elephant
If I wanted "dog" and "elephant" to use the special font, I could the store the string in the database as:
cat ${dog} rabbit ${elephant}
In other words, use "${}" as the custom escape sequence.
And then in my view I would have a helper method to process the string, and generate the appropriate HTML/CSS for words that use the escape sequence. For example, this is the kind of output I would expect it to produce:
<p>cat <span class="music">dog</span> rabbit <span class="music">elephant</span></p>
Does this seem like a reasonable solution? If so, how would I implement the view method to parse the string and the escape characters? I'm guessing some sort of regular expression?
In a way, it's sort of similar to how LaTeX allows you to render mathematical equations. For instance, in LaTeX, to activate the mathematical font for particular characters, you can do:
\mathnormal{some text}
You could just store the html snippet directly in your database.
<p>cat <span class="music">dog</span> rabbit <span class="music">elephant</span></p>
Otherwise, you would have to remake the HTML string every time it was requested.
If that's not an option, you could use a simple regex to replace the ${} with an html snippet.
string = "cat ${dog} rabbit ${elephant}"
string.gsub /\$\{([^\}]+)\}/, '<span class="music">\1</span>'
=> "cat <span class=\"music\">dog</span> rabbit <span class=\"music\">elephant</span>"

Bug in Chrome, or Stupidity in User? Sanitising inputs on forms?

I've written a more detailed post about this on my blog at:
http://idisposable.co.uk/2010/07/chrome-are-you-sanitising-my-inputs-without-my-permission/
but basically, I have a string which is:
||abcdefg
hijklmn
opqrstu
vwxyz
||
the pipes I've added to give an indiciation of where the string starts and ends, in particular note the final carriage return on the last line.
I need to put this into a hidden form variable to post off to a supplier.
In basically, any browser except chrome, I get the following:
<input type="hidden" id="pareqMsg" value="abcdefg
hijklmn
opqrstu
vwxyz
" />
but in chrome, it seems to apply a .Trim() or something else that gives me:
<input type="hidden" id="pareqMsg" value="abcdefg
hijklmn
opqrstu
vwxyz" />
Notice it's cut off the last carriage return. These carriage returns (when Encoded) come up as %0A if that helps.
Basically, in any browser except chrome, the whole thing just works and I get the desired response from the third party. In Chrome, I get an 'invalid pareq' message (which suggests to me that those last carriage returns are important to the supplier).
Chrome version is 5.0.375.99
Am I going mad, or is this a bug?
Cheers,
Terry
You can't rely on form submission to preserve the exact character data you include in the value of a hidden field. I've had issues in the past with Firefox converting CRLF (\r\n) sequences into bare LFs, and your experience shows that Chrome's behaviour is similarly confusing.
And it turns out, it's not really a bug.
Remember that what you're supplying here is an HTML attribute value - strictly, the HTML 4 DTD defines the value attribute of the <input> element as of type CDATA. The HTML spec has this to say about CDATA attribute values:
User agents should interpret attribute values as follows:
Replace character entities with characters,
Ignore line feeds,
Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA attribute values (e.g., " myval " may be interpreted as "myval"). Authors should not declare attribute values with leading or trailing white space.
So whitespace within the attribute value is subject to a number of user agent transformations - conforming browsers should apparently be discarding all your linefeeds, not only the trailing one - so Chrome's behaviour is indeed buggy, but in the opposite direction to the one you want.
However, note that the browser is also expected to replace character entities with characters - which suggests you ought to be able to encode your CRs and LFs as 
 and
, and even spaces as , eliminating any actual whitespace characters from your value field altogether.
However, browser compliance with these SGML parsing rules is, as you've found, patchy, so your mileage may certainly vary.
Confirmed it here. It trims trailing CRLFs, they don't get parsed into the browser's DOM (I assume for all HTML attributes).
If you append CRLF with script, e.g.
var pareqMsg = document.forms[0]['pareqMsg']
if (/\r\n$/.test(pareqMsg.value) == false)
pareqMsg.value += '\r\n';
...they do get maintained and POSTed back to the server. Although the hidden <textarea> idea suggested by Gaby might be easier!
Normally in an input box you cannot enter (by keyboard) a newline.. so perhaps chrome enforces this even for embedded, through the attributes, values ..
try using a textarea (with display:none)..

Resources