Related
I am implementing a DES Encryption algorithm using C++, I benchmark it on a very large document(1.1MB) plaint text.
I have now reached about 1.1 sec on encryption, I need to squeeze off more performance out of it.
I was thinking of obfuscation, will that help in optimizing my code?
I think optimizing your code is the best way to optimize it:
Fix redundant code
Rethink the logic
Remove unused or trivial variables
Store commonly used values in variables to reduce redundant computation
Obfuscation makes code harder to read by:
Replacing variable names with underscores or single letters (compilers don't use variable names)
Removing whitespace to create a neutron star of unreadable text (compilers do this internally)
Removing comments (compilers don't read comments)
Sometimes adding useless code to further hinder readability (making your program run slower)
Well, you did not write what kind of obfuscation you have in mind (on a source code level?), but generally: no, it won't. In a language like Javascript (or very old interpreted basic dialects), sometimes obfuscation and optimization go hand-in-hand (shorten variable names, deleting unnecessary whitespace/indentation etc.), but not in a compiled language like C++.
Of course, sometimes some kind of misguided optimization will lead to obfuscated code, but that is a different thing.
C++ compilers nowadays are really REALLY smart. Major optimizations come at a macroscopic level. Even Blender's example, removing unused variables, is not needed, since the optimizer will remove them anyway.
Obfuscation doesn't make your code smarter, it doesn't change algorithms, it doesn't introduce dynamic programming, or anything of the sort.
I don't see why you would want that though. With compiled languages, you don't have to ship the source code, you can, if needed, ship headers and libraries, but those don't give away implementation details.
I'm biased towards writing fool-proof applications. For example with PHP site, I validate all the inputs from client-side using JS. On the server-side I validate again. On both sides I do validation for emptiness, and other patterns (email, phone, url, number, etc). And then I strip malicious tags or characters, trim them (server-side). Later I convert the input into desired formats/data types (string, int, float, etc). If the library meant for server-side only, I even give developers chances for graceful degradation and accommodate the tolerate the worst inputs and normalize to the acceptable ones (I have predefined set of the acceptable ones).
Now I'm reading a library that I wrote one and a half years ago. I'm wondering if developers are so evil or lack IQ for me do so much of graceful degradation, finding every possible chance to make the dudes right, even they gave crappy input which seriously harms performance. Or shall I do minimal checking and expect developers to be able and are willfully to give proper input? I have no hope for end-users but should I trust developers more and give them application/library with better performance?
Common policy is to validate on the server anything sent from the client because you can't be totally sure it really was your client that sent it. You don't want to "trust developers more" and in the process find that you've "trusted hackers of your site more".
Fixing invalid input automatically can be as much a curse as a blessing -- you've essentially committed to accepting the invalid input as a valid part of your protocol (ie, in a future version if you make a change that will break the invalid input that you were correcting, it is no longer backwards compatible with the client code that has been written). In extremis, you might paint yourself into a corner that way. Also, invalid calls tend to propagate to new code -- people often copy-and-paste example code and then modify it to meet their needs. If they've copied bad code that you've been correcting at the server, you might find you start getting proportionally more and more bad data coming in, as well as confusing new programmers who think "that just doesn't look like it should be right, but it's the example everyone is using -- maybe I don't understand this after all".
Never expect diligence from developers. Always validate, if you can, any input that comes into your code, especially if it comes across a network.
End users (whether they're programmers using your tool, or non-programmers using your application) don't have to be stupid or evil to type the wrong thing in. As programmers we all too often make wrong assumptions about what's obvious for them.
That's the first thing, which justifies comprehensive validation all on its own. But validation isn't the same as guessing what they meant from what they typed, and inferring correct input from incorrect - unless the inference rules are also well known to the users (like Word's auto-correct, for instance).
But what is this performance you seek? There's no bit of client-side (or server-side, for that matter) validation that takes longer to run than the second or so that is an acceptable response time.
Validate, and ensure it doesn't break as the first priority. Then worry about making it clever enough to know (reliably) what they meant. After that, worry about how fast it is. In the real world, syntax validation doesn't make a measurable difference to anything where user input takes most of the total time.
Microsoft made the mistake of trusting programmers to do the right thing back in the days of Windows 3.1 and to a lesser extent Windows 95. You need only read a few posts from Raymond Chen to see where that road ultimately leads.
(P.S. This is not a dig against Microsoft - it's a statement on fact about how programmers abused the more liberal Win16, either deliberately or through ignorance)
I think you are right in being biased toward fool-proof applications. I would not assume that that degrades performance enough to be of much concern. Rather I would address performance concerns separately, starting by profiling or my favorite method, stackshots. There must be a way to get those in PHP.
Inspired by this question which started out innocently but is turning into a major flame war.
Let's say you need to a utility method - reasonably straightforward but not a one-liner. Quoted question was how to repeat a string X times. How do you decide whether to use a 3rd party implementation or write your own?
The obvious downside to 3rd party approach is you're adding a dependency to your code.
But if you're writing your own you need to code it, test it, (maybe) profile it so you'll likely end up spending more time.
I know the decision itself is subjective, but criteria you use to arrive at it should not be.
So, what criteria do you use to decide when to write your own code?
General Decision
Before deciding on what to use, I will create a list of criteria that must be met by the library. This could include size, simplicity, integration points, speed, problem complexity, dependencies, external constraints, and license. Depending on the situation the factors involved in making the decision will differ.
Generally, I will hunt for a suitable library that solves the problem before writing my own implementation. If I have to write my own, I will read up on appropriate algorithms and seek ideas from other implementations (e.g., in a different language).
If, after all the aspects described below, I can find no suitable library or source code, and I have searched (and asked on suitable forums), then I will develop my own implementation.
Complexity
If the task is relatively simple (e.g., a MultiValueMap class), then:
Find an existing open-source implementation.
Integrate the code.
Rewrite it, or trim it down, if it excessive.
If the task is complex (e.g., a flexible object-oriented graphing library), then:
Find an open-source implementation that compiles (out-of-the-box).
Execute its "Hello, world!" equivalent.
Perform any other evaluations as required.
Determine its suitability based on the problem domain criteria.
Speed
If the library is too slow, then:
Profile it.
Optimize it.
Contribute the results back to the community.
If the code is too complex to be optimized, and speed is a factor, discuss it with the community and provide profiling details. Otherwise, look for an equivalent, but faster (possibly less feature-rich) library.
API
If the API is not simple, then:
Write a facade and contribute it back to the community.
Or find a simpler API.
Size
If the compiled library is too large, then:
Compile only the necessary source files.
Or find a smaller library.
Bugs
If the library does not compile out of the box, seek alternatives.
Dependencies
If the library depends on scores of external libraries, seek alternatives.
Documentation
If there is insufficient documentation (e.g., user manuals, installation guides, examples, source code comments), seek alternatives.
Time Constraints
If there is ample time to find an optimal solution, then do so. Often there is not sufficient time to write from scratch. And usually there are a number of similar libraries to evaluate. Keep in mind that, by meticulous loose coupling, you can always swap one library for another. Find what works, initially, and if it later becomes a burden, replace it.
Development Environment
If the library is tied to a specific development environment, seek alternatives.
License
Open source.
10 questions ...
+++ (use library) ... --- (write own library)
Is the library exactly what I need? Customizable in a few steps? +++
Does it provide almost all functionality? Easily extensible? +++
No time? +++
It's good for one half and plays well with other? ++
Hard to extend, but excellent documentation? ++
Hard to extend, yet most of the functionality? +
Functionality ok, but outdated? -
Functionality ok, .. but weird (crazy interface, not robust, ...)? --
Library works, but the person who needs to decide is in the state of hybris? ---
Library works, manageable code size, portfolio needs update? ---
Some thoughts ...
If it is something that is small but useful, probably for others, too, then why now write a library and put it on the web. The cost publishing this kind of small libraries decreased, as well as the hurdle for others to tune in (see bitbucket or github). So what's the criteria?
Maybe it should not exactly replicate an existing already known library. If it replicates something existing, it should approach the problem from new angle, or better it should provide a shorter or more condensed* solution.
*/fun
If it's a trivial function, it's not worth pulling in an entire library.
If it's a non-trivial function, then it may be worth it.
If it's multiple functions which can all be handled by pulling in a single library, it's almost definitely worth it.
Keep it in balance
You should keep several criteria in balance. I'd consider a few topics and ask a few questions.
Developing time VS maintenance time
Can I develop what I need in a few hours? If yes, why do I need a library? If I get a lib am I sure that it will not cause hours spent to debug and documentation reading? The answer - if I need something obvious and straightforward I don't need an extra-flexible lib.
Simplicity VS flexibility
If I need just an error wrapper do I need a lib with flexible types and stack tracing and color prints and.... Nope! Using even beautifully designed but flexible and multipurpose libs could slow your code. If you plan to use 2% of functionality you don't need it.
Dig task VS small task
Did I faced a huge task and I need external code to solve it? Definitely AMQP or SQL operations is too big tasks to develop from scratch but tiny logging could be solved in place. Don't use external libs to solve small tasks.
My own libs VS external libs
Sometimes is better to grow your own library because it is for 100% used, for 100% appropriate your goals, you know it best, it is always up to date with your applications. Don't build your own lib just to be cool and keep in mind that a lot of libs in your vendor directory developed "just to be cool".
For me this would be a fairly easy answer.
If you need to be cost effective, then it would probably be best to try and find a library/framework that does what you want. If you can't find it, then you will be forced to write it or find a different approach.
If you have the time and find it fun, write one. You will learn a lot along the way and you can give back to the open source community with you killer new bundle of code. If you don't, well, then don't. But if you can't find one, then you have to write it anyway ;)
Personally, if I can justify writing a library, I always opt for that. It's fun, you learn a lot about what you are directing your focus towards, and you have another tool to add to your arsenal and put on your CV.
If the functionality is only a small part of the app, or if your needs are the same as everyone else's, then a library is probably the way to go. If you need to consume and output JSON, for example, you can probably knock something together in five minutes to handle your immediate needs. But then you start adding to it, bit by bit. Eventually, you have all the functionality that you would find in any library, but 1) you had to write it yourself and 2) it isn't a robust and well document as what you would find in a library.
If the functionality is a big part of the app, and if your needs aren't exactly the same as everyone else's, then think much more carefully. For example, if you are doing machine learning, you might consider using a package like Weka or Mahout, but these are two very different beasts, and this component is likely to be a significant part of your application. A library in this case could be a hindrance, because your needs might not fit the design parameters of the original authors, and if you attempt to modify it, you will need to worry about a much larger and more complex system than the minimum that you would build yourself.
There's a good article out there talking about sanitizing HTML, and how it was a big part of the app, and something that would need to be heavily tuned, so using an outside library wasn't the best solution, in spite of the fact that there were many libraries out that did exactly what seemed to be called for.
Another consideration is security.
If a black-hat hacker finds a bug in your code they can create an exploit and sell it for money. The more popular the library is, the more the exploit worth. Think about OpenSSL or Wordpress exploits. If you re-implement the code, chances that your code is not vulnerable exactly the same way the popular library is. And if your lib is not popular, then an zero-day exploit of your code probably wouldn't worth much, and there is a good chance your code is not targeted by bounty hunters.
Another consideration is language safety. C language can be very fast. But from the security standpoint it's asking for trouble. If you reimplement the lib in some script language, chances of arbitrary code execution exploits are low (as long as you know the possible attack vectors, like serialization, or evals).
Following my previous question regarding the rationale behind extremely long functions, I would like to present a specific question regarding a piece of code I am studying for my research. It's a function from the Linux Kernel which is quite long (412 lines) and complicated (an MCC index of 133). Basically, it's a long and nested switch statement
Frankly, I can't think of any way to improve this mess. A dispatch table seems both huge and inefficient, and any subroutine call would require an inconceivable number of arguments in order to cover a large-enough segment of code.
Do you think of any way this function can be rewritten in a more readable way, without losing efficiency? If not, does the code seem readable to you?
Needless to say, any answer that will appear in my research will be given full credit - both here and in the submitted paper.
Link to the function in an online source browser
I don't think that function is a mess. I've had to write such a mess before.
That function is the translation into code of a table from a microprocessor manufacturer. It's very low-level stuff, copying the appropriate hardware registers for the particular interrupt or error reason. In this kind of code, you often can't touch registers which have not been filled in by the hardware - that can cause bus errors. This prevents the use of code that is more general (like copying all registers).
I did see what appeared to be some code duplication. However, at this level (operating at interrupt level), speed is more important. I wouldn't use Extract Method on the common code unless I knew that the extracted method would be inlined.
BTW, while you're in there (the kernel), be sure to capture the change history of this code. I have a suspicion that you'll find there have not been very many changes in here, since it's tied to hardware. The nature of the changes over time of this sort of code will be quite different from the nature of the changes experienced by most user-mode code.
This is the sort of thing that will change, for instance, when a new consolidated IO chip is implemented. In that case, the change is likely to be copy and paste and change the new copy, rather than to modify the existing code to accommodate the changed registers.
Utterly horrible, IMHO. The obvious first-order fix is to make each case in the switch a call to a function. And before anyone starts mumbling about efficiency, let me just say one word - "inlining".
Edit: Is this code part of the Linux FPU emulator by any chance? If so this is very old code that was a hack to get linux to work on Intel chips like the 386 which didn't have an FPU. If it is, it's probably not a suitable study for academics, except for historians!
There's a kind of regularity here, I suspect that for a domain expert this actually feels very coherent.
Also having the variations in close proximty allows immediate visual inspection.
I don't see a need to refactor this code.
I'd start by defining constants for the various classes. Coming into this code cold, it's a mystery what the switching is for; if the switching was against named constants, I'd have a starting point.
Update: You can get rid of about 70 lines where the cases return MAJOR_0C_EXCP; simply let them fall through to the end of the routine. Since this is kernel code I'll mention that there might be some performance issues with that, particularly if the case order has already been optimized, but it would at least reduce the amount of code you need to deal with.
I don't know much about kernels or about how re-factoring them might work.
The main thing that comes to my mind is taking that switch statement and breaking each sub step in to a separate function with a name that describes what the section is doing. Basically, more descriptive names.
But, I don't think this optimizes the function any more. It just breaks it in to smaller functions of which might be helpful... I don't know.
That is my 2 cents.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
What's best practice for reuse of code versus copy/paste?
The problem with reuse can be that changing the reused code will affect many other pieces of functionality.
This is good & bad : good if the change is a bugfix or useful enhancement. Bad if other reusing code unexpectedly becomes broken because it relied on the old version (or the new version has a bug).
In some cases it would seem that copy/paste is better - each user of the pasted code has a private copy which it can customize without consequences.
Is there a best practice for this problem; does reuse require watertight unit tests?
Every line of code has a cost.
Studies show that the cost is not linear with the number of lines of code, it's exponential.
Copy/paste programming is the most expensive way to reuse software.
"does reuse require watertight unit tests?"
No.
All code requires adequate unit tests. All code is a candidate for reuse.
It seems to me that a piece of code that is used in multiple places that has the potential to change for one place and not for another place isn't following proper rules of scope. If the "same" method/class is needed by two different things to do two different functions, then that method/class should be split up.
Don't copy/paste. If it does turn out that you need to modify the code for one place, then you can extend it, possibly through inheritance, overloading, or if you must, copying and pasting. But don't start out by copy-pasting similar segments.
Using copy and paste is almost always a bad idea. As you said, you can have tests to check in case you break something.
The point is, when you call a method, you shouldn't really care about how it works, but about what it does. If you change the method, changing what it does, then it should be a new method, or you should check wherever this method is called.
On the other side, if the change doesn't modify WHAT the method does (only how), then you shouldn't have a problem elsewhere. If you do, you've done something wrong...
One very appropriate use of copy and paste is Triangulation. Write code for one case, see a second application that has some variation, copy & paste into the new context - but you're not done. It's if you stop at that point that you get into trouble. Having this code duplicated, perhaps with minor variation, exposes some common functionality that your code needs. Once it's in both places, tested, and working in both places, you should extract that commonality into a single place, call it from the two original places, and (of course) re-test.
If you have concerns that code which is called from multiple places is introducing risk of fragility, your functions are probably not fine-grained enough. Excessively coarse-grained functions, functions that do too much, are hard to reuse, hard to name, hard to debug. Find the atomic bits of functionality, name them, and reuse them.
So the consumer (reuser) code is dependent on the reused code, that's right.
You have to manage this dependency.
It is true for binary reuse (eg. a dll) and code reuse (eg. a script library) as well.
Consumer should depend on a certain (known) version of the reused code/binary.
Consumer should keep a copy of the reused code/binary, but never directly modify it, only update to a newer version when it is safe.
Think carefully when you modify resused codebase. Branch for breaking changes.
If a Consumer wants to update the reused code/binary then it first has to test to see if it's safe. If tests fail then Consumer can alway fall back to the last known (and kept) good version.
So you can benefit from reuse (eg. you have to fix a bug in one place), and still you're in control of changes. But nothing saves you from testing whenever you update the reused code/binary.
Is there a best practice for this
problem; does reuse require watertight
unit tests?
Yes and sort of yes. Rewriting code you have already did right once is never a good idea. If you never reuse code and just rewrite it you are doubling you bug surface. As with many best practice type questions Code Complete changed the way I do my work. Yes unit test to the best of your ability, yes reuse code and get a copy of Code Complete and you will be all set.
Copy and pasting is never good practice. Sometimes it might seem better as a short-term fix in a pretty poor codebase, but in a well designed codebase you will have the following affording easy re-use:
encapsulation
well defined interfaces
loose-coupling between objects (few dependencies)
If your codebase exhibits these properties, copy and pasting will never look like the better option. And as S Lott says, there is a huge cost to unnecessarily increasing the size of your codebase.
Copy/Paste leads to divergent functionality. The code may start out the same but over time, changes in one copy don't get reflected in all the other copies where it should.
Also, copy/paste may seem "OK" in very simple cases but it also starts putting programmers into a mindset where copy/paste is fine. That's the "slippery slope". Programmers start using copy/paste when refactoring should be the right approach. You always have to be careful about setting precedent and what signals that sends to future developers.
There's even a quote about this from someone with more experience than I,
"If you use copy and paste while you're coding, you're probably committing a design error."-- David Parnas
You should be writing unit tests, and while yes, having cloned code can in some sense give you the sense of security that your change isn't effecting a large number of other routines, it is probably a false sense of security. Basically, your sense of security comes from an ignorance of knowing how the code is used. (ignorance here isn't a pejorative, just comes from as a result of not being able to know everything about the codebase) Get used to using your IDE to learn where the code is being use, and get used to reading code to know how it is being used.
Where you write:
The problem with reuse can be that
changing the reused code will affect
many other pieces of functionality.
... In some cases it would seem that
copy/paste is better - each user of
the pasted code has a private copy
which it can customize without
consequences.
I think you've reversed the concerns related to copy-paste. If you copy code to 10 places and then need to make a slight modification to behavior, will you remember to change it in all 10 places?
I've worked on an unfortunately large number of big, sloppy codebases and generally what you'll see is the results of this - 20 versions of the same 4 lines of code. Some (usually small) subset of them have 1 minor change, some other small (and only partially intersecting subset) have some other minor change, not because the variations are correct but because the code was copied and pasted 20 times and changes were applied almost, but not quite consistently.
When it gets to that point it's nearly impossible to tell which of those variations are there for a reason and which are there because of a mistake (and since it's more often a mistake of omission - forgetting to apply a patch rather than altering something - there's not likely to be any evidence or comments).
If you need different functionality call a different function. If you need the same functionality, please avoid copy paste for the sanity of those who will follow you.
There are metrics that can be used to measure your code, and it's up to yo (or your development team) to decide on an adequate threshold. Ruby on Rails has the "Metric-Fu" Gem, which incorporates many tools that can help you refactor your code and keep it in tip top shape.
I'm not sure what tools are available for other laguages, but I believe there is one for .NET.
In general, copy and paste is a bad idea. However, like any rule, this has exceptions. Since the exceptions are less well-known than the rule I'll highlight what IMHO are some important ones:
You have a very simple design for something that you do not want to make more complicated with design patterns and OO stuff. You have two or three cases that vary in about a zillion subtle ways, i.e. a line here, a line there. You know from the nature of the problem that you won't likely ever have more than 2 or 3 cases. Sometimes it can be the lesser of two evils to just cut and paste than to engineer the hell out of the thing to solve a relatively simple problem like this. Code volume has its costs, but so does conceptual complexity.
You have some code that's very similar for now, but the project is rapidly evolving and you anticipate that the two instances will diverge significantly over time, to the point where trying to even identify reasonably large, factorable chunks of functionality that will stay common, let alone refactor these into reusable components, would be more trouble than it's worth. This applies when you believe that the probability of a divergent change to one instance is much greater than that of a change to common functionality.