Is there a case where parameter validation may be considered redundant? - validation

The first thing I do in a public method is to validate every single parameter before they get any chance to get used, passed around or referenced, and then throw an exception if any of them violate the contract. I've found this to be a very good practice as it lets you catch the offender the moment the infraction is committed but then, quite often I write a very simple getter/indexer such as this:
private List<Item> m_items = ...;
public Item GetItemByIdx( int idx )
{
if( (idx < 0) || (idx >= m_items.Count) )
{
throw new ArgumentOutOfRangeException( "idx", "Invalid index" );
}
return m_items[ idx ];
}
In this case the index parameter directly relates to the indexes in the list, and I know for a fact (e.g. documentation) that the list itself will do exactly the same and will throw the same exception. Should I remove this verification or I better leave it alone?
I wanted to know what you guys think, as I'm now in the middle of refactoring a big project and I've found many cases like the above.
Thanks in advance.

It's not just a matter of taste, consider
if (!File.Exists(fileName)) throw new ArgumentException("...");
var s = File.OpenText(fileName);
This looks similar to your example but there are several reasons (concurrency, access rights) why the OpenText() method could still fail, even with a FileNotFound error. So the Exists-check is just giving a false feeling of security and control.
It is a mind-set thing, when you are writing the GetItemByIdx method it probably looks quite sensible. But if you look around in a random piece of code there are usually lots of assumptions you could check before proceeding. It's just not practical to check them all, over and over. We have to be selective.
So in a simple pass-along method like GetItemByIdx I would argue against redundant checks. But as soon as the function adds more functionality or if there is a very explicit specification that says something about idx that argument turns around.
As a rule of thumb an exception should be thrown when a well defined condition is broken and that condition is relevant at the current level. If the condition belongs to a lower level, then let that level handle it.

I would only do parameter verification where it would lead to some improvement in code behavior. Since you know, in this case, that the check will be performed by the List itself, then your own check is redundant and provides no extra value, so I wouldn't bother.

It's true that possibly you duplicated work that's already been done in the API, but it's there now. If your error handling framework works and is solid, and isn't causing performance issues (profiling IYF) then I reckon leave it, and gradually phase it out if you have time. It doesn't sound like a top priority!

Related

What is the top type in the Hack language?

In the Hack language type system, is there a "top" type, also known as an "any" type, or a universal "Object" type? That is, a type which all types are subclasses of?
The manual mentions "mixed" types, which might be similar, but are not really explained. There is also the possibility of simply omitting the type declaration in some places. However, this cannot be done everywhere, e.g. if I want to declare something to be a function from string to the top type, it's not clear how I do this. function (string): mixed?
I'm an engineer working on Hack at Facebook. This is a really insightful and interesting question. Depending on what exactly you're getting at, Hack has a couple different variations of this.
First, let's talk about mixed. It's the supertype of everything. For example, this typechecks:
<?hh // strict
function f(): mixed {
return 42;
}
But since it's the supertype of everything, you can't do much with a mixed value until you case analyze on what it actually is, via is_int, instanceof, etc. Here's an example of how you'd have to use the result of f():
<?hh // strict
function g(): int {
$x = f();
if (is_int($x)) {
return $x;
} else {
return 0;
}
}
The "missing annotation" type ("any") is somewhat different than this. Whereas mixed is the supertype of everything, "any" unifies with everything -- it's both the supertype and subtype of everything. This means that if you leave off an annotation, we'll assume you know what you're doing and just let it pass. For example, the following code typechecks as written:
<?hh
// No "strict" since we are omitting annotations
function f2() {
return 42;
}
function g2(): string {
return f2();
}
This clearly isn't sound -- we just broke the type system and will cause a runtime type error if we execute the above code -- but it's admitted in partial mode in order to ease conversion. Strict requires that you annotate everything, and so you can't get a value of type "any" in order to break the type system in this way if all of your code is in strict. Consider how you'd have to annotate the code above in strict mode: either f2 would have to return int and that would be a straight-up type error ("string is not compatible with int"), or f2 would have to return mixed and that would be a type error as written ("string is not compatible with mixed") until you did a case analysis with is_int etc as I did in my earlier example.
Hope this clears things up -- if you want clarification let me know in the comments and I'll edit. And if you have other questions that aren't strict clarifications of this, continue tagging them "hacklang" and we'll make sure they get responded to!
Finally: if you wouldn't mind, could you press the "file a documentation bug" on the docs pages that were confusing or unclear, or could in any way be improved? We ideally want docs.hhvm.com to be a one-stop place for stuff like this, but there are definitely holes in the docs that we're hoping smart, enthusiastic folks like yourself will help point out. (i.e., I thought this stuff was explained well in the docs, but since you are confused that is clearly not the case, and we'd really appreciate a bug report detailing where you got lost.)

Clone detection algorithm

I'm writing an algorithm that detects clones in source code. E.g. if there is a block like:
for(int i = o; i <5; i++){
doSomething(abc);
}
...and if this block is repeated somewhere else in the source code it will be detected as a clone. The method I am using at the moment is to create hashes for lines/blocks and compare them with hashes of other lines/blocks in the same source to see if there are any matches.
Now, if the same block as above was to be repeated somewhere with only the argument of doSomething different, it would not be detected as a clone even though it would appear very much like a clone to you and me. My algorithm detects exact matches but doesn't detect matching blocks where only the argument is different.
Could anyone suggest any ways of getting around this issue? Thanks!
Here's a super-simple way, which might go too far in erasing information (i.e., might produce too many false positives): replace every identifier that isn't a keyword with some fixed name. So you'd get
for (int DUMMY = DUMMY; DUMMY<5; DUMMY++) {
DUMMY(DUMMY);
}
(assuming you really meant o rather than 0 in the initialization part of the for-loop).
If you get a huge number of false positives with this, you could then post-process them by, for instance, looking to see what fraction of the DUMMYs actually correspond to the same identifier in both halves of the match, or at least to identifiers that are consistent between the two.
To do much better you'll probably need to parse the code to some extent. That would be a lot more work.
Well if you're going todo something else then you're going to have to parse to code at least a bit. For example you could detect methods and then ignore the method arguments in your hash. Anyway I think it's always true that you need your program to understand the code better than 'just text blocks', and that might get awefuly complicated.

Using function arguments as local variables

Something like this (yes, this doesn't deal with some edge cases - that's not the point):
int CountDigits(int num) {
int count = 1;
while (num >= 10) {
count++;
num /= 10;
}
return count;
}
What's your opinion about this? That is, using function arguments as local variables.
Both are placed on the stack, and pretty much identical performance wise, I'm wondering about the best-practices aspects of this.
I feel like an idiot when I add an additional and quite redundant line to that function consisting of int numCopy = num, however it does bug me.
What do you think? Should this be avoided?
As a general rule, I wouldn't use a function parameter as a local processing variable, i.e. I treat function parameters as read-only.
In my mind, intuitively understandabie code is paramount for maintainability, and modifying a function parameter to use as a local processing variable tends to run counter to that goal. I have come to expect that a parameter will have the same value in the middle and bottom of a method as it does at the top. Plus, an aptly-named local processing variable may improve understandability.
Still, as #Stewart says, this rule is more or less important depending on the length and complexity of the function. For short simple functions like the one you show, simply using the parameter itself may be easier to understand than introducing a new local variable (very subjective).
Nevertheless, if I were to write something as simple as countDigits(), I'd tend to use a remainingBalance local processing variable in lieu of modifying the num parameter as part of local processing - just seems clearer to me.
Sometimes, I will modify a local parameter at the beginning of a method to normalize the parameter:
void saveName(String name) {
name = (name != null ? name.trim() : "");
...
}
I rationalize that this is okay because:
a. it is easy to see at the top of the method,
b. the parameter maintains its the original conceptual intent, and
c. the parameter is stable for the rest of the method
Then again, half the time, I'm just as apt to use a local variable anyway, just to get a couple of extra finals in there (okay, that's a bad reason, but I like final):
void saveName(final String name) {
final String normalizedName = (name != null ? name.trim() : "");
...
}
If, 99% of the time, the code leaves function parameters unmodified (i.e. mutating parameters are unintuitive or unexpected for this code base) , then, during that other 1% of the time, dropping a quick comment about a mutating parameter at the top of a long/complex function could be a big boon to understandability:
int CountDigits(int num) {
// num is consumed
int count = 1;
while (num >= 10) {
count++;
num /= 10;
}
return count;
}
P.S. :-)
parameters vs arguments
http://en.wikipedia.org/wiki/Parameter_(computer_science)#Parameters_and_arguments
These two terms are sometimes loosely used interchangeably; in particular, "argument" is sometimes used in place of "parameter". Nevertheless, there is a difference. Properly, parameters appear in procedure definitions; arguments appear in procedure calls.
So,
int foo(int bar)
bar is a parameter.
int x = 5
int y = foo(x)
The value of x is the argument for the bar parameter.
It always feels a little funny to me when I do this, but that's not really a good reason to avoid it.
One reason you might potentially want to avoid it is for debugging purposes. Being able to tell the difference between "scratchpad" variables and the input to the function can be very useful when you're halfway through debugging.
I can't say it's something that comes up very often in my experience - and often you can find that it's worth introducing another variable just for the sake of having a different name, but if the code which is otherwise cleanest ends up changing the value of the variable, then so be it.
One situation where this can come up and be entirely reasonable is where you've got some value meaning "use the default" (typically a null reference in a language like Java or C#). In that case I think it's entirely reasonable to modify the value of the parameter to the "real" default value. This is particularly useful in C# 4 where you can have optional parameters, but the default value has to be a constant:
For example:
public static void WriteText(string file, string text, Encoding encoding = null)
{
// Null means "use the default" which we would document to be UTF-8
encoding = encoding ?? Encoding.UTF8;
// Rest of code here
}
About C and C++:
My opinion is that using the parameter as a local variable of the function is fine because it is a local variable already. Why then not use it as such?
I feel silly too when copying the parameter into a new local variable just to have a modifiable variable to work with.
But I think this is pretty much a personal opinion. Do it as you like. If you feel sill copying the parameter just because of this, it indicates your personality doesn't like it and then you shouldn't do it.
If I don't need a copy of the original value, I don't declare a new variable.
IMO I don't think mutating the parameter values is a bad practice in general,
it depends on how you're going to use it in your code.
My team coding standard recommends against this because it can get out of hand. To my mind for a function like the one you show, it doesn't hurt because everyone can see what is going on. The problem is that with time functions get longer, and they get bug fixes in them. As soon as a function is more than one screen full of code, this starts to get confusing which is why our coding standard bans it.
The compiler ought to be able to get rid of the redundant variable quite easily, so it has no efficiency impact. It is probably just between you and your code reviewer whether this is OK or not.
I would generally not change the parameter value within the function. If at some point later in the function you need to refer to the original value, you still have it. in your simple case, there is no problem, but if you add more code later, you may refer to 'num' without realizing it has been changed.
The code needs to be as self sufficient as possible. What I mean by that is you now have a dependency on what is being passed in as part of your algorithm. If another member of your team decides to change this to a pass by reference then you might have big problems.
The best practice is definitely to copy the inbound parameters if you expect them to be immutable.
I typically don't modify function parameters, unless they're pointers, in which case I might alter the value that's pointed to.
I think the best-practices of this varies by language. For example, in Perl you can localize any variable or even part of a variable to a local scope, so that changing it in that scope will not have any affect outside of it:
sub my_function
{
my ($arg1, $arg2) = #_; # get the local variables off the stack
local $arg1; # changing $arg1 here will not be visible outside this scope
$arg1++;
local $arg2->{key1}; # only the key1 portion of the hashref referenced by $arg2 is localized
$arg2->{key1}->{key2} = 'foo'; # this change is not visible outside the function
}
Occasionally I have been bitten by forgetting to localize a data structure that was passed by reference to a function, that I changed inside the function. Conversely, I have also returned a data structure as a function result that was shared among multiple systems and the caller then proceeded to change the data by mistake, affecting these other systems in a difficult-to-trace problem usually called action at a distance. The best thing to do here would be to make a clone of the data before returning it*, or make it read-only**.
* In Perl, see the function dclone() in the built-in Storable module.
** In Perl, see lock_hash() or lock_hash_ref() in the built-in Hash::Util module).

Help me refactor this loop

I am working on the redesign of an existing class. In this class about a 400-line while loop that does most of the work. The body of the loop is a minefield of if statements, variable assignments and there is a "continue" in the middle somewhere. The purpose of the loop is hard to understand.
In pseudocode, here's where I'm at the redesign:
/* Some code here to create the objects based on config parameters */
/* Rather than having if statements scattered through the loop I */
/* create instances of the appropriate classes. The constructors */
/* take a database connection. */
FOR EACH row IN mySourceOfData
int p = batcher.FindOrCreateBatch( row );
int s = supplierBatchEntryCreator.CreateOrUpdate( row, p );
int b = buyerBatchEntryCreator.CreateOrUpdate( row, p );
mySouceOfData.UpdateAsIncludedInBatch( p, s, b);
NEXT
/* Allow things to complete their last item */
mySupplierBatchEntry.finish();
myBuyerBatchEntry.finish();
myBatcher.finish();
/* Some code here to dispose of things */
RETURN myBatch.listOfBatches();
Inside FindOrCreateBatch() it figures out using some rules if a new batch needs to be created or if an existing one can be used. The different implementations of this interface will have different rules for how it finds them, etc. The return value is the surrogate key (id) from the database of the payment batch that it found or created. This id is required by following processes that take p as a parameter.
This is an improvement over where I started, but I have an uneasy feeling about the class containing this loop.
It doesn't seem to a be a domain object, it's more of a "Manager" or "Controller" type class.
It seems to be getting inbetween batcher and supplierBatchEntryCreator (and the other classes). At the moment only an int is passed, but if that changes all three classes need to change. This seems like a Law of Dementer violation.
Any suggestions, or is this ok? The actual language is java.
I have a couple of questions to ask you:
Does it work?
Is it fast enough?
Is it readable/maintainable?
If the answer to all three is yes then, beyond that, further changes are really just wasted effort in my opinion. Don't refactor just for the sake of refactoring.
Far too often people change things in anticipation of what might be (your "changing int" for example). I prefer to subscribe to the YAGNI school of thought. The right time to worry about that is when you do it.
And the Law of Demeter is a design guideline, not a rule. In the real world, pragmatism usually beats dogmatism :-)
What is the relationship between each XXXEntryCreator and XXXEntry? I feel like I am missing something, since the "Creators" only return integers.
Beyond that, you took 400 lines of crud down to something that fits on a screen, and has a reasonably visible data flow between steps. Kudos. (I have experienced strong resistance in the past for trying to make such changes -- why do people write N-100/1000 line run-on else-if drivel?)
FindOrCreate and CreateOrUpdate suggest to me that maybe multiple passes would be simpler (and not knowing the rest of the code, I can't know if it would degrade performance, which is a common concern raised when multiple passes are suggested).
If you had one loop to create any missing batches, suppliers, and buyers (or three loops), then this loop could be reduced to
FOR EACH row IN mySourceOfData
int p = batcher.FindBatch( row );
int s = supplierBatchEntryCreator.Update( row, p );
int b = buyerBatchEntryCreator.Update( row, p );
mySouceOfData.UpdateAsIncludedInBatch( p, s, b);
NEXT
Now I see that the Creator's are updating - is that right? Does splitting the creation and update responsibility into two classes make sense, perhaps?
It's starting to look a little simpler to me. Does it help?

Exceptions for flow of control

There is an interesting post over here about this, in relation to cross-application flow of control.
Well, recently, I've come across an interesting problem. Generating the nth value in a potentially (practically) endless recursive sequence. This particular algorithm WILL be in atleast 10-15 stack references deep at the point that it succeeds. My first thought was to throw a SuccessException that looked something like this (C#):
class SuccessException : Exception
{
public string Value
{ get; set; }
public SuccessException(string value)
: base()
{
Value = value;
}
}
Then do something like this:
try
{
Walk_r(tree);
}
catch (SuccessException ex)
{
result = ex.Value;
}
Then my thoughts wandered back here, where I've heard over and over to never use Exceptions for flow control. Is there ever an excuse? And how would you structure something like this, if you were to implement it?
In this case I would be looking at your Walk_r method, you should have something that returns a value, throwing an exception to indicate success, is NOT a common practice, and at minimum is going to be VERY confusing to anyone that sees the code. Not to mention the overhead associated with exceptions.
walk_r should simply return the value when it is hit. It's is a pretty standard recursion example. The only potential problem I see is that you said it is potentially endless, which will have to be compensated for in the walk_r code by keeping count of the recursion depth and stopping at some maximum value.
The exception actually makes the coding very strange since the method call now throws an exception to return the value, instead of simply returning 'normally'.
try
{
Walk_r(tree);
}
catch (SuccessException ex)
{
result = ex.Value;
}
becomes
result = Walk_r(tree);
I'm going to play devil's advocate here and say stick with the exception to indicate success. It might be expensive to throw/catch but that may be insignificant compared with the cost of the search itself and possibly less confusing than an early exit from the method.
It's not a very good idea to throw exceptions as a part of an algorithm, especially in .net. In some languages/platforms, exceptions are pretty efficient when thrown, and they usually are, when an iterable gets exhausted for instance.
Why not just return the resulting value? If it returns anything at all, assume it is successful. If it fails to return a value, then it means the loop failed.
If you must bring back from a failure, then I'd recommend you throw an exception.
The issue with using exceptions is that tey (in the grand scheme of things) are very inefficient and slow. It would surely be as easy to have a if condition within the recursive function to just return as and when needed. To be honest, with the amount of memory on modern PC's its unlikely (not impossible though) that you'll get a stack overflow with only a small number of recursive calls (<100).
If the stack is a real issue, then it might become necessary to be 'creative' and implement a 'depth limited search strategy', allow the function to return from the recursion and restart the search from the last (deepest) node.
To sum up: Exceptions should only be used in exceptional circumstances, the success of a function call i don't believe qualifies as such.
Using exceptions in normal program flow in my book is one of the worst practises ever.
Consider the poor sap who is hunting for swallowed exceptions and is running a debugger set to stop whenever an exception happens. That dude is now getting mad.... and he has an axe. :P

Resources