Which is the more important code quality metric - between cyclomatic complexity or average fan out? - static-analysis

I was working on TICS code quality metrics for the first time, and I have this question.
Many suggest breaking large functions into one or more functions in order to keep complexity less than 15. Doing so would increase number of functions called by the given function, hence average fan out would be increased.
Should we make the decision to decrease fan out or decrease cyclomatic complexity? Decreasing Cyclomatic complexity would increase maintainability, but splitting functions into 2 or more functions would increase the number of function calls, which would cost most memory.
So which of these two metrics is more important to improve?

I doubt you'll find a definitive answer. In general, I'd advise worrying more about cyclomatic complexity than fan out. You can reduce the cyclomatic complexity of a method using the extract method refactoring. Assuming that the new method that you extracted has a specific purpose that is easy-to-name, you will have replaced detailed logic (the various conditions) with a summary of those conditions (the extracted method). This should make the code easier to read and maintain.

All software quality metrics (including cyclomatic complexity and fanout) are at best unreliable measures of quality. I would strongly caution against trying to improve any metric just for the sake of it.
Metrics can however be reliably used to draw your attention to code that may need attention. If a class or method scores poorly for some metric, it is worth looking at it and deciding if there really is an issue that should be fixed, but it's developer judgment that should make final call on what action should be taken.

Related

Why is algorithm time complexity often defined in terms of steps/operations?

I've been doing a lot of studying from many different resources on algorithm analysis lately, and one thing I'm currently confused about is why time complexity is often defined in terms of the number of steps/operations an algorithm performs.
For instance, in Introduction to Algorithms, 3rd Edition by Cormen, he states:
The running time of an algorithm on a particular input is the number of primitive operations or “steps” executed. It is convenient to define the notion of step so that it is as machine-independent as possible.
I've seen other resources define the time complexity as such as well. I have a problem with this because, for one, it's called TIME complexity, not "step complexity" or "operations complexity." Secondly, while it's not a definitive source, an answer to a post here on Stackoverflow states "Running time is how long it takes a program to run. Time complexity is a description of the asymptotic behavior of running time as input size tends to infinity." Further, on the Wikipedia page for time complexity it states "In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm." Again, these are definitive sources, things makes logical sense using these definitions.
When analyzing an algorithm and deriving its time complexity function, such as in Figure 1 below, you get an equation that is in units of time. It CAN represent the amount of operations the algorithm performs, but only if those constant factors (C_1, C_2, C_3, etc.) are each a value of 1.
Figure 1
So with all that said, I'm just wondering how it's possible for this to be defined as the number of steps when that's not really what it represents. I'm trying to clear things up and make the connection between time and number of operations. I feel like there is a lot of information that hasn't been explicitly stated in the resources I've studied. Hoping someone can help clear things up for me, and without going into discussion about Big-O because that shouldn't be needed and misses the point of the question, in my opinion.
Thank you everyone for your time and help.
why time complexity is often defined in terms of the number of steps/operations an algorithm performs?
TL;DR: because that is how the asymptotic analysis work; also, do not forget, that time is a relative thing.
Longer story:
Measuring the performance in time, as we, humans understand the time in a daily use, doesn't make much sense, as it is not always that trivial task to do.. furthermore - it even makes no sense in a broader perspective.
How would you measure what is the space and time your algorithm takes? what will be the conditional and predefined unit of the measurement you're going to apply to see the running time/space complexity of your algorithm?
You can measure it on your clock, or use some libraries/API to see exactly how many seconds/minutes/megabytes your algorithm took.. or etc.
However, this all will be VERY much variable! because, the time/space your algorithm took, will depend on:
Particular hardware you're using (architecture, CPU, RAM, etc.);
Particular programming language;
Operating System;
Compiler, you used to compile your high-level code into lower abstraction;
Other environment-specific details (sometimes, even on the temperature.. as CPUs might be scaling operating frequency dynamically)..
therefore, it is not the good thing to measure your complexity in the precise timing (again, as we understand the timing on this planet).
So, if you want to know the complexity (let's say time complexity) of your algorithm, why would it make sense to have a different time for different machines, OSes, and etc.? Algorithm Complexity Analysis is not about measuring runtime on a particular machine, but about having a clear and mathematically defined precise boundaries for the best, average and worst cases.
I hope this makes sense.
Fine, we finally get to the point, that algorithm analysis should be done as a standalone, mathematical complexity analysis.. which would not care what is the machine, OS, system architecture, or anything else (apart from algorithm itself), as we need to observe the logical abstraction, without caring about whether you're running it on Windows 10, Intel Core2Duo, or Arch Linux, Intel i7, or your mobile phone.
What's left?
Best (so far) way for the algorithm analysis, is to do the Asymptotic Analysis, which is an abstract analysis calculated on the basis of input.. and that is counting almost all the steps and operations performed in the algorithm, proportionally to your input.
This way you can speak about the Algorithm, per se, instead of being dependent on the surrounding circumstances.
Moreover; not only we shouldn't care about machine or peripheral factors, we also shouldn't care about Lower Order Terms and Constant Factors in the mathematical expression of the Asymptotic Analysis.
Constant Factors:
Constant Factors are instructions which are independent from the Input data. i.e. which are NOT dependent on the input argument data.
Few reasons why you should ignore them are:
Different programming language syntaxes, as well as their compiled files, will have different number of constant operations/factors;
Different Hardware will give different run-time for the same constant factors.
So, you should eliminate thinking about analyzing constant factors and overrule/ignore them. Only focus on only input-related important factors; therefore:
O(2n) == O(5n) and all these are O(n);
6n2 == 10n2 and all these are n2.
One more reason why we won't care about constant factors is that they we usually want to measure the complexity for sufficiently large inputs.. and when the input grows to the + infinity, it really makes no sense whether you have n or 2n.
Lower order terms:
Similar concept applies in this point:
Lower order terms, by definition, become increasingly irrelevant as you focus on large inputs.
When you have 5x4+24x2+5, you will never care much on exponent that is less than 4.
Time complexity is not about measuring how long an algorithm takes in terms of seconds. It's about comparing different algorithms, how they will perform with a certain amount if input data. And how this performance develops when the input data gets bigger.
In this context, the "number of steps" is an abstract concept for time, that can be compared independently from any hardware. Ie you can't tell how long it will take to execute 1000 steps, without exact specifications of your hardware (and how long one step will take). But you can always tell, that executing 2000 steps will take about twice as long as executing 1000 steps.
And you can't really discuss time complexity without going into Big-O, because that's what it is.
You should note that Algorithms are more abstract than programs. You check two algorithms on a paper or book and you want to analyze which works faster for an input data of size N. So you must analyze them with logic and statements. You can also run them on a computer and measure the time, but that's not proof.
Moreover, different computers execute programs at different speeds. It depends on CPU speed, RAM, and many other conditions. Even a program on a single computer may be run at different speeds depending on available resources at a time.
So, time for algorithms must be independent of how long a single atomic instruction takes to be executed on a specific computer. It's considered just one step or O(1). Also, we aren't interested in constants. For example, it doesn't matter if a program has two or 10 instructions. Both will be run on a fraction of microseconds. Usually, the number of instructions is limited and they are all run fast on computers. What is important are instructions or loops whose execution depends on a variable, which could be the size of the input to the program.

SonarQube: Qualify Cognitive Complexity

I understand what cognitive complexity is and how it will be calculated, but i don't now how to determine what is a good value for that measure, so how complex my code schouldn't be. I need an objective way to estimate it without to compare project against each other. A kind of formula like "complexity/lines code" or something like that.
Or if i define a quality gate for a big project how can i calculate the values for it.
At a method level, 15 is a recommended maximum.
At the class level, it depends on what you expect in the package.
For instance, in a package that should only hold classes with fields and simple getters or setters, a class with a Cognitive Complexity over 0 (5? 10?) probably deserves another look.
On the other hand, in a package that holds business logic classes, a class score >= ... 150(?) might indicate that it's time to look at splitting the class.
In terms of what the limit should be for a project, that's unanswerable, and brings us back to Fred Brooks' essential vs accidental complexity. Basically, there's a certain amount of logic that's required to get the job done. Complexity beyond that is accidental and could theoretically be eliminated. Figuring out the difference between the two is the crux of the issue, and in looking for the accidental complexity I would concentrate on methods where the complexity crosses the default threshold of 15.
To answer your initial question, "What should the limit for an application be?", I would say there shouldn't be one. Because the essential complexity for a simple calculator app is far, far lower than for a program on the Space Shuttle. And if you try to make the Space Shuttle program fit inside the calculator threshold, you're absolutely going to break something.
(Disclosure: I'm the primary author of Cognitive Complexity)

Significance of Asymptotic Time complexity while Programming/Coding?

I have heard a lot about time complexity.Time complexity itself is an Aproximation,since we are concerned about Worstcase(Big-Oh),Bestcase(Big-Omega) and Average(Theta).
Each programming language includes lots of built-in-functions.I really dont know if there is a way to check th time complexity of these functions.
Since we are using buit-in-functions,
Do we really need to Consider the Time complexity while coding?
What about the space complexity?
Is there any way to check the time complexity of these functions.
Since we are using buit-in-functions?
Do we really need to Consider the Time complexity while coding?
If your application needs to be able to scale up to larger problems, then yes. Otherwise no.
What about the space complexity?
Same answer.
Is there any way to check the time complexity of these functions. Since we are using built-in-functions?
Read the documentation. The complexity of methods for standard classes is often documented.
Use your knowledge of algorithmics. For example, you should have been taught in your algorithmic unit in your CS course that sorting is O(NlogN) for decent sort algorithms, or that finding an element in a list is O(N) on average. (If you didn't take an algorithmics uint, there are lots of good textbooks ...)
Inspect, and if necessary analyze, the source code of the builtin functions.
(Note: I do not recommend the "empirical" approach of estimating complexity. It can give you the wrong answer ... even ignoring the standard issues with measurement methodology.)
Is there any way to check the time complexity of these functions. Since we are using built-in-functions?
Yes.
The most common way is to run it in a loop (10,000 - 10,000,000 times, depending on the function, software, precision of timers etc.) with timers (timestamp before and after, stopwatch, etc.) and then compare it to your other options.
EDIT:
You need to use many different input sizes, and note the potentially confusing effect of caching on measurements.

Initial Genetic Programming Parameters

I did a little GP (note:very little) work in college and have been playing around with it recently. My question is in regards to the intial run settings (population size, number of generations, min/max depth of trees, min/max depth of initial trees, percentages to use for different reproduction operations, etc.). What is the normal practice for setting these parameters? What papers/sites do people use as a good guide?
You'll find that this depends very much on your problem domain - in particular the nature of the fitness function, your implementation DSL etc.
Some personal experience:
Large population sizes seem to work
better when you have a noisy fitness
function, I think this is because the growth
of sub-groups in the population over successive generations acts
to give more sampling of
the fitness function. I typically use
100 for less noisy/deterministic functions, 1000+
for noisy.
For number of generations it is best to measure improvements in the
fitness function and stop when it
meets your target criteria. I normally run a few hundred generations and see what kind of answers are coming out, if it is showing no improvement then you probably have an issue elsewhere.
Tree depth requirements are really dependent on your DSL. I sometimes try to do an
implementation without explicit
limits but penalise or eliminate
programs that run too long (which is probably
what you really care about....). I've also found total node counts of ~1000 to be quite useful hard limits.
Percentages for different mutation / recombination operators don't seem
to matter all that much. As long as
you have a comprehensive set of mutations, any reasonably balanced
distribution will usually work. I think the reason for this is that you are basically doing a search for favourable improvements so the main objective is just to make sure the trial improvements are reasonably well distributed across all the possibilities.
Why don't you try using a genetic algorithm to optimise these parameters for you? :)
Any problem in computer science can be
solved with another layer of
indirection (except for too many
layers of indirection.)
-David J. Wheeler
When I started looking into Genetic Algorithms I had the same question.
I wanted to collect data variating parameters on a very simple problem and link given operators and parameters values (such as mutation rates, etc) to given results in function of population size etc.
Once I started getting into GA a bit more I then realized that given the enormous number of variables this is a huge task, and generalization is extremely difficult.
talking from my (limited) experience, if you decide to simplify the problem and use a fixed way to implement crossover, selection, and just play with population size and mutation rate (implemented in a given way) trying to come up with general results you'll soon realize that too many variables are still into play because at the end of the day the number of generations after which statistically you will get a decent result (whatever way you wanna define decent) still obviously depend primarily on the problem you're solving and consequently on the genome size (representing the same problem in different ways will obviously lead to different results in terms of effect of given GA parameters!).
It is certainly possible to draft a set of guidelines - as the (rare but good) literature proves - but you will be able to generalize the results effectively in statistical terms only when the problem at hand can be encoded in the exact same way and the fitness is evaluated in a somehow an equivalent way (which more often than not means you're ealing with a very similar problem).
Take a look at Koza's voluminous tomes on these matters.
There are very different schools of thought even within the GP community -
Some regard populations in the (low) thousands as sufficient whereas Koza and others often don't deem if worthy to start a GP run with less than a million individuals in the GP population ;-)
As mentioned before it depends on your personal taste and experiences, resources and probably the GP system used!
Cheers,
Jan

What can be parameters other than time and space while analyzing certain algorithms?

I was interested to know about parameters other than space and time during analysing the effectiveness of an algorithms. For example, we can focus on the effective trap function while developing encryption algorithms. What other things can you think of ?
First and foremost there's correctness. Make sure your algorithm always works, no matter what the input. Even for input that the algorithm is not designed to handle, you should print an error mesage, not crash the entire application. If you use greedy algorithms, make sure they truly work in every case, not just a few cases you tried by hand.
Then there's practical efficiency. An O(N2) algorithm can be a lot faster than an O(N) algorithm in practice. Do actual tests and don't rely on theoretical results too much.
Then there's ease of implementation. You usually don't need the best intro sort implementation to sort an array of 100 integers once, so don't bother.
Look for worst cases in your algorithms and if possible, try to avoid them. If you have a generally fast algorithm but with a very bad worst case, consider detecting that worst case and solving it using another algorithm that is generally slower but better for that single case.
Consider space and time tradeoffs. If you can afford the memory in order to get better speeds, there's probably no reason not to do it, especially if you really need the speed. If you can't afford the memory but can afford to be slower, do that.
If you can, use existing libraries. Don't roll your own multiprecision library if you can use GMP for example. For C++, stuff like boost and even the STL containers and algorithms have been worked on for years by an army of people and are most likely better than you can do alone.
Stability (sorting) - Does the algorithm maintain the relative order of equal elements?
Numeric Stability - Is the algorithm prone to error when very large or small real numbers are used?
Correctness - Does the algorithm always give the correct answer? If not, what is the margin of error?
Generality - Does the algorithm work in many situation (e.g. with many different data types)?
Compactness - Is the program for the algorithm concise?
Parallelizability - How well does performance scale when the number of concurrent threads of execution are increased?
Cache Awareness - Is the algorithm designed to maximize use of the computer's cache?
Cache Obliviousness - Is the algorithm tuned for particulary cache-sizes / cache-line-sizes or does it perform well regardless of the parameters of the cache?
Complexity. 2 algorithms being the same in all other respects, the one that's much simpler is going to be a much better candidate for future customization and use.
Ease of parallelization. Depending on your use case, it might not make any difference or, on the other hand, make the algorithm useless because it can't use 10000 cores.
Stability - some algorithms may "blow up" with certain test conditions, e.g. take an inordinately long time to execute, or use an inordinately large amount of memory, or perhaps not even terminate.
For algorithms that perform floating point operations, the accumulation of round-off error is often a consideration.
Power consumption, for embedded algorithms (think smartcards).
One important parameter that is frequently measure in the analysis of algorithms is that of Cache hits and cache misses. While this is a very implementation and architecture dependent issue, it is possible to generalise somewhat. One particularly interesting property of the algorithm is being Cache-oblivious, which means that the algorithm will use the cache optimally on multiple machines with different cache sizes and structures without modification.
Time and space are the big ones, and they seem so plain and definitive, whereby they should often be qualified (1). The fact that the OP uses the word "parameter" rather than say "criteria" or "properties" is somewhat indicative of this (as if a big O value on time and on space was sufficient to frame the underlying algorithm).
Other criteria include:
domain of applicability
complexity
mathematical tractability
definitiveness of outcome
ease of tuning (may be tied to "complexity" and "tactability" afore mentioned)
ability of running the algorithm in a parallel fashion
(1) "qualified": As hinted in other answers, a -technically- O(n^2) algorithm may be found to be faster than say an O(n) algorithm, in 90% of the cases (which, btw, may turn out to be 100% of the practical cases)
worst case and best case are also interesting, especially when linked to some conditions in the input. if your input data shows some properties, an algorithm, by taking advantage of this property, may perform better that another algorithm which performs the same task but does not use that property.
for example, many sorting algorithm perform very efficiently when input are partially ordered in a specific way which minimizes the number of operations the algorithm has to execute.
(if your input is mostly sorted, an insertion sort will fit nicely, while you would never use that algorithm otherwise)
If we're talking about algorithms in general, then (in the real world) you might have to think about CPU/filesystem(read/write operations)/bandwidth usage.
True they are way down there in the list of things you need worry about these days, but given a massive enough volume of data and cheap enough infrastructure you might have to tweak your code to ease up on one or the other.
What you are interested aren’t parameters, rather they are intrinsic properties of an algorithm.
Anyway, another property you might be interested in, and analyse an algorithm for, concerns heuristics (or rather, approximation algorithms), i.e. algorithms which don’t find an exact solution but rather one that is (hopefully) good enough.
You can analyze how far a solution is from the theoretical optimal solution in the worst case. For example, an existing algorithm (forgot which one) approximates the optimal travelling salesman tour by a factor of two, i.e. in the worst case it’s twice as long as the optimal tour.
Another metric concerns randomized algorithms where randomization is used to prevent unwanted worst-case behaviours. One example is randomized quicksort; quicksort has a worst-case running time of O(n2) which we want to avoid. By shuffling the array beforehand we can avoid the worst-case (i.e. an already sorted array) with a very high probability. Just how high this probability is can be important to know; this is another intrinsic property of the algorithm that can be analyzed using stochastic.
For numeric algorithms, there's also the property of continuity: that is, whether if you change input slightly, output also changes only slightly. See also Continuity analysis of programs on Lambda The Ultimate for a discussion and a link to an academical paper.
For lazy languages, there's also strictness: f is called strict if f _|_ = _|_ (where _|_ denotes the bottom (in the sense of domain theory), a computation that can't produce a result due to non-termination, errors etc.), otherwise it is non-strict. For example, the function \x -> 5 is non-strict, because (\x -> 5) _|_ = 5, whereas \x -> x + 1 is strict.
Another property is determinicity: whether the result of the algorithm (or its other properties, such as running time or space consumption) depends solely on its input.
All these things in the other answers about the quality of various algorithms are important and should be considered.
But time and space are two things that vary at some rate compared to the size of the input (n). So what else can vary according to n?
There are several that are related to I/O. For example, the number of writes to a disk is an important one, which may not be directly shown by space and time estimates alone. This becomes particularly important with flash memory, where the number of writes to the same memory location is the significant metric in some algorithms.
Another I/O metric would be "chattiness". A networking protocol might send shorter messages more often adding up to the same space and time as another networking protocol, but some aspect of the system (perhaps billing?) might make minimizing either the size or number of the messages desireable.
And that brings us to Cost, which is a very important algorithmic consideration sometimes. The cost of an algorithm may be affected by both space and time in different amounts (consider the separate costing of server storage space and gigabits of data transfer), but the cost is the thing that you wish to minimize overall, so it may have its own big-O estimations.

Resources