php mail form security: most reliable way to spot new lines - preg-match

I am trying to build a secure php contact form to allow users (and hopefully not spammers) to send mail.
I am looking at the way of detecting new lines in the from: field, with which users will submit their email address and in the subject: field.
I have 2 alternatives aof the same function to detect new lines and I would like your opinion about which one would be the most reliable (meaning working in the most cases):
function containingnewlines1($stringtotest) {
if (preg_match("/(%0A|%0D|\\n+|\\r+)/i", $stringtotest) != 0) {
echo "Newline found. Suspected injection attempt";
exit;
}
}
function containingnewlines2($stringtotest) {
if (preg_match("/^\R$/", $stringtotest) != 0) {
echo "Newline found. Suspected injection attempt";
exit;
}
}
Thank you in advance for your opinions!
Cheers

The vastly more pertinent question is "Which one is more reliable?". The efficiency of either approach is irrelevant because neither approach should take more than a few milliseconds to execute. Trying to decide between the two based on a matter of milliseconds is a micro-optimization.
Furthermore, what do you mean by efficiency? Do you mean which one is faster? Which one consumes the least memory? Efficiency is an ill-defined term, you need to be more specific.
If you absolutely must make a decision based on performance/efficiency requirements then I'd recommend constructing a benchmark and finding out for yourself which one is the closest fit to your requirements, because at the end of the day only you can answer that question.

I added myself 2 more funcs and did a benchmark of 100000 loops:
function containingnewlines3($stringtotest) {
return (strpbrk($stringtotest,"\r\n") !== FALSE);
}
function containingnewlines4($stringtotest) {
return (strpos($stringtotest,"\n") !== FALSE && strpos($stringtotest,"\r\n") !== FALSE);
}
$start = microtime(TRUE);
for($x=0;$x<100000;$x++) {
containingnewlines1($html); // 0.272623 ms
containingnewlines2($html); // 0.244299 ms
containingnewlines3($html); // 0.377767 ms
containingnewlines4($html); // 0.142282 ms
}
echo (microtime(TRUE) - $start);

Actually, I decided to use the first function, as it covers 2 more cases (%OA and %OD) and as it also includes all the new lines characters variations used by different OSes (\n, \n\r etc).

Related

SCAN command with spring redis template

I am trying to execute "scan" command with RedisConnection. I don't understand why the following code is throwing NoSuchElementException
RedisConnection redisConnection = redisTemplate.getConnectionFactory().getConnection();
Cursor c = redisConnection.scan(scanOptions);
while (c.hasNext()) {
c.next();
}
Exception:
java.util.NoSuchElementException at
java.util.Collections$EmptyIterator.next(Collections.java:4189) at
org.springframework.data.redis.core.ScanCursor.moveNext(ScanCursor.java:215)
at
org.springframework.data.redis.core.ScanCursor.next(ScanCursor.java:202)
Yes, I have tried this, in 1.6.6.RELEASE spring-data-redis.version. No issues, the below simple while loop code is enough. And i have set count value to 100 (more the value) to save round trip time.
RedisConnection redisConnection = null;
try {
redisConnection = redisTemplate.getConnectionFactory().getConnection();
ScanOptions options = ScanOptions.scanOptions().match(workQKey).count(100).build();
Cursor c = redisConnection.scan(options);
while (c.hasNext()) {
logger.info(new String((byte[]) c.next()));
}
} finally {
redisConnection.close(); //Ensure closing this connection.
}
I'm using spring-data-redis 1.6.0-RELEASE and Jedis 2.7.2; I do think that the ScanCursor implementation is slightly flawed w/rgds to handling this case on this version - I've not checked previous versions though.
So: rather complicated to explain, but in the ScanOptions object there is a "count" field that needs to be set (default is 10). This field, contains an "intent" or "expected" results for this search. As explained (not really clearly, IMHO) here, you may change the value of count at each invocation, especially if no result has been returned. I understand this as "a work intent" so if you do not get anything back, maybe your "key space" is vast and the SCAN command has not worked "hard enough". Obviously, as long as you're getting results back, you do not need to increase this.
A "simple-but-dangerous" approach would be to have a very large count (e.g 1 million or more). This will make REDIS go away trying to search your vast key space to find "at least or near as much" as your large count. Don't forget - REDIS is single-threaded so you just killed your performance. Try this with a REDIS of 12M keys and you'll see that although SCAN may happily return results with a very high count value, it will absolutely do nothing more during the time of that search.
To the solution to your problem:
ScanOptions options = ScanOptions.scanOptions().match(pattern).count(countValue).build();
boolean done = false;
// the while-loop below makes sure that we'll get a valid cursor -
// by looking harder if we don't get a result initially
while (!done) {
try(Cursor c = redisConnection.scan(scanOptions)) {
while (c.hasNext()) {
c.next();
}
done = true; //we've made it here, lets go away
} catch (NoSuchElementException nse) {
System.out.println("Going for "+countValue+" was not hard enough. Trying harder");
options = ScanOptions.scanOptions().match(pattern).count(countValue*2).build();
}
}
Do note that the ScanCursor implementation of Spring Data REDIS will properly follow the SCAN instructions and loop correctly, as much as needed, to get to the end of the loop as per documentation. I've not found a way to change the scan options within the same cursor - so there may be a risk that if you get half-way through your results and get a NoSuchElementException, you'll start again (and essentially do some of the work twice).
Of course, better solutions are always welcome :)
My old code
ScanOptions.scanOptions().match("*" + query + "*").count(10).build();
Working code
ScanOptions.scanOptions().match("*" + query + "*").count(Integer.MAX_VALUE).build();

Sorting CouchDB Views By Value

I'm testing out CouchDB to see how it could handle logging some search results. What I'd like to do is produce a view where I can produce the top queries from the results. At the moment I have something like this:
Example document portion
{
"query": "+dangerous +dogs",
"hits": "123"
}
Map function
(Not exactly what I need/want but it's good enough for testing)
function(doc) {
if (doc.query) {
var split = doc.query.split(" ");
for (var i in split) {
emit(split[i], 1);
}
}
}
Reduce Function
function (key, values, rereduce) {
return sum(values);
}
Now this will get me results in a format where a query term is the key and the count for that term on the right, which is great. But I'd like it ordered by the value, not the key. From the sounds of it, this is not yet possible with CouchDB.
So does anyone have any ideas of how I can get a view where I have an ordered version of the query terms & their related counts? I'm very new to CouchDB and I just can't think of how I'd write the functions needed.
It is true that there is no dead-simple answer. There are several patterns however.
http://wiki.apache.org/couchdb/View_Snippets#Retrieve_the_top_N_tags. I do not personally like this because they acknowledge that it is a brittle solution, and the code is not relaxing-looking.
Avi's answer, which is to sort in-memory in your application.
couchdb-lucene which it seems everybody finds themselves needing eventually!
What I like is what Chris said in Avi's quote. Relax. In CouchDB, databases are lightweight and excel at giving you a unique perspective of your data. These days, the buzz is all about filtered replication which is all about slicing out subsets of your data to put in a separate DB.
Anyway, the basics are simple. You take your .rows from the view output and you insert it into a separate DB which simply emits keyed on the count. An additional trick is to write a very simple _list function. Lists "render" the raw couch output into different formats. Your _list function should output
{ "docs":
[ {..view row1...},
{..view row2...},
{..etc...}
]
}
What that will do is format the view output exactly the way the _bulk_docs API requires it. Now you can pipe curl directly into another curl:
curl host:5984/db/_design/myapp/_list/bulkdocs_formatter/query_popularity \
| curl -X POST host:5984/popularity_sorter/_design/myapp/_view/by_count
In fact, if your list function can handle all the docs, you may just have it sort them itself and return them to the client sorted.
This came up on the CouchDB-user mailing list, and Chris Anderson, one of the primary developers, wrote:
This is a common request, but not supported directly by CouchDB's
views -- to do this you'll need to copy the group-reduce query to
another database, and build a view to sort by value.
This is a tradeoff we make in favor of dynamic range queries and
incremental indexes.
I needed to do this recently as well, and I ended up doing it in my app tier. This is easy to do in JavaScript:
db.view('mydesigndoc', 'myview', {'group':true}, function(err, data) {
if (err) throw new Error(JSON.stringify(err));
data.rows.sort(function(a, b) {
return a.value - b.value;
});
data.rows.reverse(); // optional, depending on your needs
// do something with the data…
});
This example runs in Node.js and uses node-couchdb, but it could easily be adapted to run in a browser or another JavaScript environment. And of course the concept is portable to any programming language/environment.
HTH!
This is an old question but I feel it still deserves a decent answer (I spent at least 20 minutes on searching for the correct answer...)
I disapprove of the other suggestions in the answers here and feel that they are unsatisfactory. Especially I don't like the suggestion to sort the rows in the applicative layer, as it doesn't scale well and doesn't deal with a case where you need to limit the result set in the DB.
The better approach that I came across is suggested in this thread and it posits that if you need to sort the values in the query you should add them into the key set and then query the key using a range - specifying a desired key and loosening the value range. For example if your key is composed of country, state and city:
emit([doc.address.country,doc.address.state, doc.address.city], doc);
Then you query just the country and get free sorting on the rest of the key components:
startkey=["US"]&endkey=["US",{}]
In case you also need to reverse the order - note that simple defining descending: true will not suffice. You actually need to reverse the start and end key order, i.e.:
startkey=["US",{}]&endkey=["US"]
See more reference at this great source.
I'm unsure about the 1 you have as your returned result, but I'm positive this should do the trick:
emit([doc.hits, split[i]], 1);
The rules of sorting are defined in the docs.
Based on Avi's answer, I came up with this Couchdb list function that worked for my needs, which is simply a report of most-popular events (key=event name, value=attendees).
ddoc.lists.eventPopularity = function(req, res) {
start({ headers : { "Content-type" : "text/plain" } });
var data = []
while(row = getRow()) {
data.push(row);
}
data.sort(function(a, b){
return a.value - b.value;
}).reverse();
for(i in data) {
send(data[i].value + ': ' + data[i].key + "\n");
}
}
For reference, here's the corresponding view function:
ddoc.views.eventPopularity = {
map : function(doc) {
if(doc.type == 'user') {
for(i in doc.events) {
emit(doc.events[i].event_name, 1);
}
}
},
reduce : '_count'
}
And the output of the list function (snipped):
165: Design-Driven Innovation: How Designers Facilitate the Dialog
165: Are Your Customers a Crowd or a Community?
164: Social Media Mythbusters
163: Don't Be Afraid Of Creativity! Anything Can Happen
159: Do Agencies Need to Think Like Software Companies?
158: Customer Experience: Future Trends & Insights
156: The Accidental Writer: Great Web Copy for Everyone
155: Why Everything is Amazing But Nobody is Happy
Every solution above will break couchdb performance I think. I am very new to this database. As I know couchdb views prepare results before it's being queried. It seems we need to prepare results manually. For example each search term will reside in database with hit counts. And when somebody searches, its search terms will be looked up and increments hit count. When we want to see search term popularity, it will emit (hitcount, searchterm) pair.
The Link Retrieve_the_top_N_tags seems to be broken, but I found another solution here.
Quoting the dev who wrote that solution:
rather than returning the results keyed by the tag in the map step, I would emit every occurrence of every tag instead. Then in the reduce step, I would calculate the aggregation values grouped by tag using a hash, transform it into an array, sort it, and choose the top 3.
As stated in the comments, the only problem would be in case of a long tail:
Problem is that you have to be careful with the number of tags you obtain; if the result is bigger than 500 bytes, you'll have couchdb complaining about it, since "reduce has to effectively reduce". 3 or 6 or even 20 tags shouldn't be a problem, though.
It worked perfectly for me, check the link to see the code !

How much information hiding is necessary when doing code refactoring?

How much information hiding is necessary? I have boilerplate code before I delete a record, it looks like this:
public override void OrderProcessing_Delete(Dictionary<string, object> pkColumns)
{
var c = Connect();
using (var cmd = new NpgsqlCommand("SELECT COUNT(*) FROM orders WHERE order_id = :_order_id", c)
{ Parameters = { {"_order_id", pkColumns["order_id"]} } } )
{
var count = (long)cmd.ExecuteScalar();
// deletion's boilerplate code...
if (count == 0) throw new RecordNotFoundException();
else if (count > 1) throw new DatabaseStructureChangedException();
// ...boiler plate code
}
// deleting of table(s) goes here...
}
NOTE: boilerplate code is code-generated, including the "using (var cmd = new NpgsqlCommand( ... )"
But I'm seriously thinking to refactor the boiler plate code, I wanted a more succint code. This is how I envision to refactor the code (made nicer with extension method (not the sole reason ;))
using (var cmd = new NpgsqlCommand("SELECT COUNT(*) FROM orders WHERE order_id = :_order_id", c)
{ Parameters = { {"_order_id", pkColumns["order_id"]} } } )
{
cmd.VerifyDeletion(); // [EDIT: was ExecuteWithVerification before]
}
I wanted the executescalar and the boilerplate code to goes inside the extension method.
For my code above, does it warrants code refactoring / information hiding? Is my refactored operation looks too opaque?
I would say that your refactor is extremely good, if your new single line of code replaces a handful of lines of code in many places in your program. Especially since the functionality is going to be the same in all of those places.
The programmer coming after you and looking at your code will simply look at the definition of the extension method to find out what it does, and now he knows that this code is defined in one place, so there is no possibility of it differing from place to place.
Try it if you must, but my feeling is it's not about succinctness but whether or not you want to enforce the behavior every time or most of the time. And by extension, if the verify-condition changes that it would likely change across the board.
Basically, reducing a small chunk of boiler-plate code doesn't necessarily make things more succinct; it's just one more bit of abstractness the developer has to wade through and understand.
As a developer, I'd have no idea what "ExecuteWithVerify" means. What exactly are we verifying? I'd have to look it up and remember it. But with the boiler-plate code, I can look at the code and understand exactly what's going on.
And by NOT reducing it to a separate method I can also tune the boiler-plate code for cases where exceptions need to be thrown for differing conditions.
It's not information-hiding when you extract or refactor your code. It's only information-hiding when you start restricting access to your extension definition after refactoring.
"new" operator within a Class (except for the Constructor) should be Avoided at all costs. This is what you need to refactor here.

Flatten conditional as a refactoring

Consider:
if (something) {
// Code...
}
With CodeRush installed it recommended doing:
if (!something) {
return;
}
// Code...
Could someone explain how this is better? Surely there is no benefit what so ever.
Isolated, as you've presented it - no benefit. But mark4o is right on: it's less nesting, which becomes very clear if you look at even, say a 4-level nesting:
public void foo() {
if (a)
if (b)
if (c)
if (d)
doSomething();
}
versus
public void foo() {
if (!a)
return;
if (!b)
return;
if (!c)
return;
if (!d)
return;
doSomething();
}
early returns like this improve readability.
In some cases, it's cleaner to validate all of your inputs at the beginning of a method and just bail out if anything is not correct. You can have a series of single-level if checks that check successively more and more specific things until you're confident that your inputs are good. The rest of the method will then be much easier to write, and will tend to have fewer nested conditionals.
One less level of nesting.
This is a conventional refactoring meant for maintainability. See:
http://www.refactoring.com/catalog/replaceNestedConditionalWithGuardClauses.html
With one condition, it's not a big improvement. But it follows the "fail fast" principle, and you really start to notice the benefit when you have lots of conditions. If you grew up on "structured programming", which typically recommends functions have single exit points, it may seem unnatural, but if you've ever tried to debug code that has three levels or more of nested conditionals, you'll start to appreciate it.
It can be used to make the code more readable (by way of less nesting). See here for a good example, and here for a good discussion of the merits.
That sort of pattern is commonly used to replace:
void SomeMethod()
{
if (condition_1)
{
if (condition_2)
{
if (condition_3)
{
// code
}
}
}
}
With:
void SomeMethod()
{
if (!condition_1) { return; }
if (!condition_2) { return; }
if (!condition_3) { return; }
// code
}
Which is much easier on the eyes.
I don't think CodeRush is recommending it --- rather just offering it as an option.
IMO, it depends on if something or !something is the exceptional case. If there is a significant amount of code if something happens, then using the !something conditional makes more sense for legibility and potential nesting reduction.
Well, look at it this way (I'll use php as an example):
You fill a form and go to this page: validate.php
example 1:
<?php
if (valid_data($_POST['username'])) {
if (valid_data($_POST['password'])) {
login();
} else {
die();
}
} else {
die();
}
?>
vs
<?php
if (!valid_data($_POST['username'])) {
die();
}
if (!valid_data($_POST['password'])) {
die();
}
login();
?>
Which one is better and easier to maintain? Remember this is just validating two things. Imagine this for a register page or something else.
I remember very clearly losing marks on a piece of college work because I had gone with the
if (!something) {
return;
}
// Code...
format. My lecturer pontificated that it was bad practice to have more than one exit point in a function. I thought that was nuts and 20+ years of computer programming later, I still do.
To be fair, he lived in an era where the lingua franca was C and functions were often pages long and full of nested conditionals making it difficult to track what was going on.
Then and now, however, simplicity is king: Keeping functions small and commenting them well is the best way to make things readable and maintainable.

Which syntax is better for return value?

I've been doing a massive code review and one pattern I notice all over the place is this:
public bool MethodName()
{
bool returnValue = false;
if (expression)
{
// do something
returnValue = MethodCall();
}
else
{
// do something else
returnValue = Expression;
}
return returnValue;
}
This is not how I would have done this I would have just returned the value when I knew what it was. which of these two patterns is more correct?
I stress that the logic always seems to be structured such that the return value is assigned in one plave only and no code is executed after it's assigned.
A lot of people recommend having only one exit point from your methods. The pattern you describe above follows that recommendation.
The main gist of that recommendation is that if ou have to cleanup some memory or state before returning from the method, it's better to have that code in one place only. having multiple exit points leads to either duplication of cleanup code or potential problems due to missing cleanup code at one or more of the exit points.
Of course, if your method is couple of lines long, or doesn't need any cleanup, you could have multiple returns.
I would have used ternary, to reduce control structures...
return expression ? MethodCall() : Expression;
I suspect I will be in the minority but I like the style presented in the example. It is easy to add a log statement and set a breakpoint, IMO. Plus, when used in a consistent way, it seems easier to "pattern match" than having multiple returns.
I'm not sure there is a "correct" answer on this, however.
Some learning institutes and books advocate the single return practice.
Whether it's better or not is subjective.
That looks like a part of a bad OOP design. Perhaps it should be refactored on the higher level than inside of a single method.
Otherwise, I prefer using a ternary operator, like this:
return expression ? MethodCall() : Expression;
It is shorter and more readable.
Return from a method right away in any of these situations:
You've found a boundary condition and need to return a unique or sentinel value: if (node.next = null) return NO_VALUE_FOUND;
A required value/state is false, so the rest of the method does not apply (aka a guard clause). E.g.: if (listeners == null) return null;
The method's purpose is to find and return a specific value, e.g.: if (nodes[i].value == searchValue) return i;
You're in a clause which returns a unique value from the method not used elsewhere in the method: if (userNameFromDb.equals(SUPER_USER)) return getSuperUserAccount();
Otherwise, it is useful to have only one return statement so that it's easier to add debug logging, resource cleanup and follow the logic. I try to handle all the above 4 cases first, if they apply, then declare a variable named result(s) as late as possible and assign values to that as needed.
They both accomplish the same task. Some say that a method should only have one entry and one exit point.
I use this, too. The idea is that resources can be freed in the normal flow of the program. If you jump out of a method at 20 different places, and you need to call cleanUp() before, you'll have to add yet another cleanup method 20 times (or refactor everything)
I guess that the coder has taken the design of defining an object toReturn at the top of the method (e.g., List<Foo> toReturn = new ArrayList<Foo>();) and then populating it during the method call, and somehow decided to apply it to a boolean return type, which is odd.
Could also be a side effect of a coding standard that states that you can't return in the middle of a method body, only at the end.
Even if no code is executed after the return value is assigned now it does not mean that some code will not have to be added later.
It's not the smallest piece of code which could be used but it is very refactoring-friendly.
Delphi forces this pattern by automatically creating a variable called "Result" which will be of the function's return type. Whatever "Result" is when the function exits, is your return value. So there's no "return" keyword at all.
function MethodName : boolean;
begin
Result := False;
if Expression then begin
//do something
Result := MethodCall;
end
else begin
//do something else
Result := Expression;
end;
//possibly more code
end;
The pattern used is verbose - but it's also easier to debug if you want to know the return value without opening the Registers window and checking EAX.

Resources