Sorting algorithm for unpredicted data [closed] - algorithm

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Considering the fact that currently many libraries already have optimized sort engines, then why still many companies ask about Big O and some sorting algorithms, when in reality in our every day in computing, this type of implementation is not any longer really needed?
For example, self balancing binary tree, is a kind of problem that some big companies from the embedded industry still ask programmers to code as part of their testing and candidate selection process.
Even for embedded coding, there are any circumstances when such kind of implementation is going to take place, given fact that boost, SQLite and other libraries are available for use? In other words, is it worth still to think on ways to optimize such algorithms?

As an embedded programmer, I would say it all comes down to the problem and system constraints. Especially on a microprocessor, you may not want/need to pull in Boost and SQLite may still be too heavy for a given problem. How you chop up problems looks different if you have say, 2K of RAM - but this is definitely the extreme.
So for example, you probably don't want to rewrite code for a red-black tree yourself because as you pointed out, you will likely end up with highly non-optimized code. But in the pursuit of reusability, abstraction often adds layers of indirection to the code. So you may end up reimplementing at least simpler "solved" problems where you can do better than the built-in library for certain niche cases. Recently I wrote a specialized version of linked lists using shared memory pools for allocation across lists, for example. I had benchmarked against STL's list and it just wasn't cutting it because of the added layers of abstraction. Is my code as good? No, probably not. But I was more easily able to specialize the specific uses cases, so it came out better.
So I guess I'd like to address a few things from your post:
-Why do companies still ask about big-O runtime? I have seen even pretty good engineers make bad choices with regards to data structures because they didn't reason carefully about the O() runtime. Quadratic versus linear or linear versus constant time operation is a painful lesson when you get the assumption wrong. (ah, the voice of experience)
-Why do companies still ask about implementing classic structures/algorithms? Arguably you won't reimplement quick sort, but as stated, you may very well end up implementing slightly less complicated structures on a more regular basis. Truthfully, if I'm looking to hire you, I'm going to want to make sure that you understand the theory inside and out for existing solutions so if I need you to come up with a new solution you can take an educated crack at it. And if the other applicant has that and you don't, I'd probably say they have an advantage.
Also, here's another way to think about it. In software development, often the platform is quite powerful and the consumer already owns the hardware platform. In embedded software development, the consumer is probably purchasing the hardware platform - hopefully from your company. This means that the software is often selling the hardware. So often it makes more cents to use less powerful, cheaper hardware to solve a problem and take a bit more time to develop the firmware. The firmware is a one-time cost upfront, whereas hardware is per-unit. So even from the business side there are pressures for constrained hardware which in turn leads to specialized structure implementation on the software side.

If you suggest using SQLite on a 2 kB Arduino, you might hear a lot of laughter.
Embedded systems are bit special. They often have extraordinarily tight memory requirements, so space complexity might be far more important than time complexity. I've rarely worked in that area myself, so I don't know what embedded systems companies are interested in, but if they're asking such questions, then it's probably because you'll need to be more acquainted with such issues than in other areas of I.T.

Nothing is optimized enough.
Besides, the questions are meant to test your understanding of the solution (and each part of the solution) and not how great you are at memorizing stuff. Hence it makes perfect sense to ask such questions.

Related

Application of dynamic programming in real world programming [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have found that dynamic programming is a bit skillful and demanding. But since I expect myself to become an adequate software engineer, I am wondering in which development scenario will DP massively be used or in other words are there any practical usage of it in development based on modern computers?
If you think about design patterns like Proxy pattern and dynamic proxy, which is broadly used in spring framework, DP seems like it is only useful in tech interview.
Also, application of parallelized computing and distributed system seems not easy to empower DP in modern computer context.
Are there any not rare scenarios where DP is widely used in very practical ways?
Please forgive my ignorance, since I haven't meet DP in real production level development, which makes me doubt the meaning of digging into DP.
I agree with # Matt Timmermans.
You don't learn about DP in case you have to use DP someday. By practicing DP, you learn ways of thinking about problems that will make you a better developer. In 10 years, nobody will care about the spring framework, but the techniques you learned from DP will still serve you well.
Now, the answer to your questions, part by parts:
1) Why DP if we have modern computers?
I think you got confused by the analogy of modern computers and the need for DP. Although modern computers are powerful in processing, you may think why I need DP if I have modern fast processors to run my application on.
Not every task can be executed on these modern computers as they come up with storage, network, and compute costs. In fact, as an engineer, we should be thinking of optimizing the usage of such resources, that is, making your code efficient to make it capable of running on minimum system configurations.
In today's world, we have a shared service architecture. It means that different independent services share resources. But the fact is they are interdependent indirectly. Imagine what will happen if a non-optimized code is consuming a lot of memory and compute time. These processors will face difficulty in allocating resources for other services or applications.
The thing is, "Why should I buy an apartment if a multi-bedroom flat can meet my needs and also creates an opportunity for others to buy a flat in the same apartment?"
2) DP in tech interviews
The fact that makes DP the most challenging topic to ace is the number of variations in DP.
It checks your ability to break down a difficult task into small ones to avoid reputations and thus save time, efforts, and thus overall resources.
That is one of the most prime reasons why DP is part of tech interviews.
Not only DP teaches you to optimize and learn useful things, but it also highlights bad practices of writing codes.
3) Usage of DP in real life
In Google Maps to find the shortest path between source and the series of destinations (one by one) out of the various available paths.
In networking to transfer data from a sender to various receivers in a sequential manner.
Document Distance Algorithms- to identify the extent of similarity between two text documents used by Search engines like Google, Wikipedia, Quora, and other websites
Edit distance algorithm used in spell checkers.
Databases caching common queries in memory: through dedicated cache tiers storing data to avoid DB access, web servers store common data like configuration that can be used across requests. Then multiple levels of caching in code abstractions within every single request that prevents fetching the same data multiple times and save CPU cycles by avoiding recomputation. Finally, caches within your browser or mobile phones that keep the data that doesn't need to be fetched from the server every time.
Git merge. Document diffing is one of the most prominent uses of LCS.
Dynamic programming is used in TeX's system of calculating the right amounts of hyphenations and justifications.
Genetic algorithms.
Also, I found a great answer on Quora which lists the areas in which DP can be used:
Operations research,
Decision making,
Query optimization,
Water resource engineering,
Economics,
Reservoir Operations problems,
Connected speech recognition,
Slope stability analysis,
Using Matlab,
Using Excel,
Unit commitment,
Image processing,
Optimal Inventory control,
Reservoir operational Problems,
Sap Abap,
Sequence Alignment,
Simulation for sewer management,
Finance,
Production Optimization,
Genetic Algorithms for permutation problem,
Haskell,
HTML,
Healthcare,
Hydropower scheduling,
LISP,
Linear space,
XML indexing and querying,
Business,
Bioinformatics

How do you write data structures that are as efficient as possible in GHC? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
So sometimes I need to write a data structure I can't find on Hackage, or what I find isn't tested or quality enough for me to trust, or it's just something I don't want to be a dependency. I am reading Okasaki's book right now, and it's quite good at explaining how to design asymptotically fast data structures.
However, I am working specifically with GHC. Constant factors are a big deal for my applications. Memory usage is also a big deal for me. So I have questions specifically about GHC.
In particular
How to maximize sharing of nodes
How to reduce memory footprint
How to avoid space leaks due to improper strictness/laziness
How to get GHC to produce tight inner loops for important sections of code
I've looked around various places on the web, and I have a vague idea of how to work with GHC, for example, looking at core output, using UNPACK pragmas, and the like. But I'm not sure I get it.
So I popped open my favorite data structures library, containers, and looked at the Data.Sequence module. I can't say I understand a lot of what they're doing to make Seq fast.
The first thing that catches my eye is the definition of FingerTree a. I assume that's just me being unfamiliar with finger trees though. The second thing that catches my eye is all the SPECIALIZE pragmas. I have no idea what's going on here, and I'm very curious, as these are littered all over the code.
Many functions also have an INLINE pragma associated with them. I can guess what that means, but how do I make a judgement call on when to INLINE functions?
Things get really interesting around line ~475, a section headered as 'Applicative Construction'. They define a newtype wrapper to represent the Identity monad, they write their own copy of the strict state monad, and they have a function defined called applicativeTree which, apparently is specialized to the Identity monad and this increases sharing of the output of the function. I have no idea what's going on here. What sorcery is being used to increase sharing?
Anyway, I'm not sure there's much to learn from Data.Sequence. Are there other 'model programs' I can read to gain wisdom? I'd really like to know how to soup up my data structures when I really need them to go faster. One thing in particular is writing data structures that make fusion easy, and how to go about writing good fusion rules.
That's a big topic! Most has been explained elsewhere, so I won't try to write a book chapter right here. Instead:
Real World Haskell, ch 25, "Performance" - discusses profiling, simple specialization and unpacking, reading Core, and some optimizations.
Johan Tibell is writing a lot on this topic:
Computing the size of a data structure
Memory footprints of common data types
Faster persistent structures through hashing
Reasoning about laziness
And some things from here:
Reading GHC Core
How GHC does optimization
Profiling for performance
Tweaking GC settings
General improvements
More on unpacking
Unboxing and strictness
And some other things:
Intro to specialization of code and data
Code improvement flags
applicativeTree is quite fancy, but mainly in a way which has to do with FingerTrees in particular, which are quite a fancy data structure themselves. We had some discussion of the intricacies over at cstheory. Note that applicativeTree is written to work over any Applicative. It just so happens that when it is specialized to Id then it can share nodes in a manner that it otherwise couldn't. You can work through the specialization yourself by inlining the Id methods and seeing what happens. Note that this specialization is used in only one place -- the O(log n) replicate function. The fact that the more general function specializes neatly to the constant case is a very clever bit of code reuse, but that's really all.
In general, Sequence teaches more about designing persistent data structures than about all the tricks for eeking out performance, I think. Dons' suggestions are of course excellent. I'd also just browse through the source of the really canonical and tuned libs -- Map, IntMap, Set, and IntSet in particular. Along with those, its worth taking a look at Milan's paper on his improvements to containers.

Should developers know discrete math? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Is it important for developers to know discrete math? Most of the books about algorithms and analysis have at least some references to math. I can easily understand the algorithms in principle and can implement them without any problem, but when it comes to the math parts I get stuck. Is it generally assumed that developers will have deep knowledge of math to understand algorithms and methods?
It depends on the kind of developer you're talking about and the kind of math you're talking about. I'm pretty sure that most of the "ordinary" developers don't need to know much about math. But, do you want to be "ordinary" developer?
If you're developing web applications that just display and allow editing of data in database, then you'll probably never need any math.
On the other hand, if you're developing a GPS system that shows a path to the target (or some other application that does more complicated calculations) then discrete math is going to be useful.
It doesn't necessarily be discrete math though - for example, in the finance industry, people need probability and statistics far more often.
That said, knowing math will definitely make you a better developer, because it trains your mind in a way that is useful not only for solving specific (math) problems, but also teaches you how to think about problems in a more formal fashion (which, I believe, is important for writing correct code).
Yes.
I find that discrete math is fairly core to computer science. Understanding set theory, boolean algebra, maps, etc. are all beneficial to a developer and are all part of discrete math.
Of course, the concepts won't always be applicable in the most academic sense. You will almost never open your discrete math textbook and copy something into your code to solve a problem. However, understanding those concepts will help developers write better code, better algorithms, and use design patterns more effectively.
Depends on what the developer is doing probably. If you are doing web, probably not, maybe a bit when it gets to security. Like the time a brute force attack takes to decode a key of certain number of bit under a certain hash. If you are doing graphics for high end games, you probably need to understand quite a bit of maths and the pros and cons of the methods. As a DB admin or network, you shouldn't.
The kinds of problems that you get the chance to solve depends on what you know.
If you only know 4th grade math, you'll only be asked to solve problems that involve math that's at a 4th grade level or less.
If you aspire to do more or understand the underpinnings of other algorithms, you'll have to learn whatever math is necessary.
I think you'll find that working through the points where you get stuck will improve your math, your appreciation for the problems you solve, and a better chance at extending the math you learn to new areas.
It nauseates me to hear people immediately disparage an area that they find difficult, as if to justify their unwillingness to push past the pain of ignorance and struggle. Learning any new skill requires that you push through that barrier, be it math or anything else. I'd advise you to stay with it and prove to yourself that you can master something difficult by resisting the urge to give up.
You have already found out that the results of discrete mathematics are useful in programming. My experience is that understanding why something works, rather than attempting to simply follow it by rote, allows you to find and fix many errors and misunderstandings. It also allows you to handle situations that are almost, but not quite, the way they are in the textbook, and to realise when the textbook answer no longer applies. Time spent understanding even a small part of something that you might use or work on will not be wasted.
If your job is pure CS like Google search where you invent new algorithms then yes, you'll need to be able to analyse running times quite well, also anything sciency like physics simulations.
If you're a 'normal' developer then you'll need to know about running times and what they mean and their impact on your application.
My experience is this:
Knowing something about discrete mathematics is something you will never regret. It will make your job easier in many instances, even in mundane tasks, because you will be familiar enough with various concepts that will at least allow you to construct a smarter google query. In depth familiarity and ability to do these things by rote is probably not helpful to most programmers, but a passing familiarity definitely is.
That said, most programmers in industry that I've encountered (and even some in academia!) know almost nothing about this stuff, so not knowing it is unlikely to place you at a significant disadvantage outside of a few specialist programming sub-disciplines.
You're generally not expected to have a deep knowledge of maths, unless the application requires it - e.g. you're writing financial software, or doing some 3D modelling, load balancing on aircraft, writing some tailored compression algorithm etc. I've worked with great developers who struggled with simple maths. And knowing discreet maths seems very specific. Having an understanding of how a variety of algorithms work can be helpful, and if you can do that it doesn't much matter that you can't construct a proof of their optimality etc.
To be honest what I think is most important is understanding the business you're building for, and how you approach writing code (readability, modular etc)
Some knowledge of discrete math might someday help you stop before spending a lot of time attempting to code up something mathematically impossible or of NP complexity to solve. You will gain a much better "feel" for when some proposed software problem or solution path more resembles an easy homework assignment, or one of those term projects which no one in your class ever completed.
It depends about what part of discrete math you are talking. Of course knowing math will always be an advantage... but I think that knowing of some parts of discrete math are not just advantage, but are very essential for a developer (of course it depends from projects on which he/she is working).
But topics like :
Set theory
Graph theory
Alg. Analysis
Alg. Complexity
Sorting
etc...
are essential for a developer.

How to estimate the contribution of an individual to a software project? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I work on a software project and would like to estimate the percentage out of the total contribution that I have put in the development of the software. Is there some tool doing this? Such a tool can be useful for appraisals or negotiations, for example. After all, we work for money (yes, not only money, put the point remains). I think there is enough hand-waving for the most important things.
The estimation is very subjective (at least to me now) but I do not know of any tool that provides even a subjective estimate. I know of Sloccount that spells out the total effort using the lines of code but not on per-developer basis.
My idea of an ideal tool for this purpose would:
measure the complexity of the code (more complex is more effort, but more effort is not necessarily more contribution)
measure the decomposibility/flexibility of the software (more decomposable is better)
how much library code is used -- using library code speeds up the development process, increases the associated risk and requires the developer to know from before or learn about the library.
be intelligent enough to differentiate between "who wrote the code", "who copied the code" and "who indented the code".
It is difficult to differentiate between the complexity in the implementation and the intrinsic complexity of the problem. Perhaps a comparison can be made with an equivalent open source counterpart if there is, or for each submodule separately.
If there is no such tool, is there no merit in having such a tool? Or do you believe in "I do work, I do not measure"? It takes time after all. Perhaps the project manager should do this estimation continuously, say, weekly. Are there any standards? Yes, standardization is difficult because every project has different goals, but perhaps that should mean there should be multiple standards, not no standards at all. This looks similar to the how a company is valued in the market.
Update: after seeing a few initial answers: It does not make sense to imagine a tool that just outputs the percentages. Are there tools that can help humans (particularly managers) in making better decisions? Or what is the sufficient statistic for making better decisions? Are these statistics available?
I really doubt there is any reliable trustworthy way of measuring individual's contribution to the solution. Sometimes rewriting some complicated legacy code that results in less lines of code, less complicated solution (smaller cyclomatic complexity etc.) can be seen as a quite significant contribution, while in other cases deleting valuable code covering edge cases that results in the same statistics (less lines of code, smaller CC etc.) is definitely something bad. It all comes down to people, trust and cooperation, individualism in the team is almost always wrong and I would rather avoid it and especially not use it as a motivation factor.
This is a research topic on its own. There are several tools that have tried to define metrics like code ownership. There are other approaches which tackle other aspect of collaborative development, for instance the trustability we can have in the code.
There has been also several studies that tried to use the information from bug trackers. For instance, to identify the developer that is the more likely to introduce bugs. But it's hard to be objective (A brilliant developer that is assigned the most critical part of the system, will still be more likely to introduce critical bugs).
It's actually hard to monetize the development tasks. What is the cost of a bug? What is the gain of refactoring? That would be however one way to estimate the contribution of a developer.
The last cool tool I saw of this kind was the Game Plugin for Hudson continuous integration system. A score is assigned to each developer according their actions
-10 if they break the build
-1 for breaking a test
+1 for fixing a test
etc.
That's again a way to somehow assess the contribution of the developer.
All in all, I do feel like what you are asking for exist, but is still very immature.
I don't think you can get a tool to evaluate your share of the project. Measuring lines of source is all very well, but what of the quality of that source? You wouldn't want someone taking the credit for 200 lines of source if the job could have been easiy done in 20...
Also, thinking of my employer for a moment, a lot of people contribute to the project in ways other than code. Immediate examples I can think of would be Project Managers and Testers - both of whom are essential, both of whom rightly deserve some credit.
Martin
The only thing that I could imagine would be a voting system. I have absolutely no idea, if that would work in your team or anywhere - but I'm sure, that you will need humans for any realistic estimation of code quality.
In Stroustrup's Book on C++ I've read once "Don't try to solve social problems with technical means".
Thinking progmatically, the attitude and the ability of a programmer could be very quickly estimated by making a code-review together and having a talk on relevant topics.
Thinking as an IT-enthusiast and as a control-freak, this shouldn't be very hard, to implement a teachable machine-learning software, which uses version-cotrol, bug-database, etc and greates real-time performanced data for each contributor. E.g. R, KNIME or WEKA could be used for this.

Software projects and development in a research environment [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
What are useful strategies to adopt when you or the project does not have a clear idea of what the final (if any) product is going to be?
Let us take "research" to mean an exploration into an area where many things are not known or implemented and where a formal set of deliverables cannot be specified at the start of the project. This is common in STEM (science (physics, chemistry, biology, materials, etc.), technology engineering, medicine) and many areas of informatics and computer science. Software is created either as an end in itself (e.g. a new algorithm), a means of managing data (often experimental) and simulation (e.g. materials, reactions, etc.). It is usually created by small groups or individuals (I omit large science such as telescopes and hadron colliders where much emphasis is put of software engineering.)
Research software is characterised by (at least):
unknown outcome
unknown timescale
little formal project management
limited budgets (in academia at least)
unpredictability of third-party tools and libraries
changes in the outside world during the project (e.g. new discoveries which can be positive - save effort - or negative - getting scooped
Projects can be anything from days ("see if this is a worthwhile direction to go") to years ("this is my PhD topic") or longer. Frequently the people are not hired as software people but find they need to write code to get the research done or get infected by writing software. There is generally little credit for good software engineering - the "product" is a conference or journal publication.
However some of these projects turn out to be highly valuable - the most obvious area is genomics where in the early days scientists showed that dynamic programming was a revolutionary tool to help thinking about protein and nucleic structure - now this is a multi-billion industry (or more). The same is true for quantum mechanics codes to predict properties of substances.
The downside is that much code gets thrown away and it is difficult to build on. To try to overcome this we have build up libraries which are shared in the group and through the world as Open Source (but here again there is very little credit given). Many researchers reinvent the wheel ("head-down" programming where colleagues are not consulted and "hero" programming where someone tries to do the whole lot themself).
Too much formality at the start of a project often puts people off and innovation is lost (no-one will spend 2 months writing formal specs and unit tests). Too little and bad habits are developed and promulgated. Programming courses help but again it's difficult to get people doing them especially when you rely on their goodwill. Mentoring is extremely valuable but not always successful.
Are there online resources which can help to persuade people into good software habits?
EDIT: I'm grateful for dmckee (below) for pointing out a similar discussion. It's all good stuff and I particularly agree with version control as being one of the most important things that we can offer scientists (we offered this to our colleagues and got very good takeup). I also like the approach of the Software Carpentry course mentioned there.
It's extremely difficult. The environment both you and Stefano Borini describe is very accurate. I think there are three key factors which propagate the situation.
Short-term thinking
Lack of formal training and experience
Continuous turnover of grad students/postdocs to shoulder the brunt of new development
Short-term thinking. There are a few reasons that short-term thinking is the norm, most of them already well explained by Stefano. As well as the awful pressure to publish and the lack of recognition for software creation, I would emphasise the number of short-term contracts. There is simply very little advantage for more junior academics (PhD students and postdocs) to spend any time planning long-term software strategies, since contracts are 2-3 years. In the case of longer-term projects e.g. those based around the simulation code of a permanent member of staff, I have seen some applications of basic software engineering, things like simple version control, standard test cases, etc. However even in these cases, project management is extremely primitive.
Lack of formal training and experience. This is a serious handicap. In astronomy and astrophysics, programming is an essential tool, but understanding of the costs of development, particularly maintenance overheads, is extremely poor. Because scientists are normally smart people, there is a feeling that software engineering practices don't really apply to them, and that they can 'just make it work'. With more experience, most programmers realise that writing code that mostly works isn't the hard part; maintaining and extending it efficiently and safely is. Some scientific code is throwaway, and in these cases the quick and dirty approach is adequate. But all too often, the code will be used and reused for years to come, bringing consequent grief to all involved with it.
Continuous turnover of grad students/postdocs for new development. I think this is the key feature that allows the academic approach to software to continue to survive. If the code is horrendous and takes days to understand and debug, who pays that price? In general, it's not the original author (who has probably moved on). Nor is it the permanent member of staff, who is often only peripherally involved with new development. It is normally the graduate student who is implementing new algorithms, producing novel approaches, trying to extend the code in some way. Sometimes it will be a postdoc, hired specifically to work on adding some feature to an existing code, and contractually obliged to work on this area for some fraction of their time.
This model is hugely inefficient. I know a PhD student in astrophysics who spent over a year trying to implement a relatively basic piece of mathematics, only a few hundred lines of code, in an existing n-body code. Why did it take so long? Because she literally spent weeks trying to understand the existing, horribly written code, and how to add her calculation to it, and months more ineffectively debugging her problems due to the monolithic code structure, coupled with her own lack of experience. Note that there was almost no science involved in this process; just wasting time grappling with code. Who ultimately paid that price? Only her. She was the one who had to put more hours in to try and get enough results to make a PhD. Her supervisor will get another grad student after she's gone - and so the cycle continues.
The point I'm trying to make is that the problem with the software creation process in academia is endemic within the system itself, a function of the resources available and the type of work that is rewarded. The culture is deeply embedded throughout academia. I don't see any easy way of changing that culture through external resources or training. It's the system itself that needs to change, to reward people for writing substantial code, to place increased scrutiny on the correctness of results produced using scientific code, to recognise the importance of training and process in code, and to hold supervisors jointly responsible for wasting the time of the members of their research group.
I'll tell you my experience.
It is undoubt that a lot of software gets created and wasted in the academia. Fact is that it's difficult to adapt research software, purposely created for a specific research objective, to a more general environment. Also, the product of academia are scientific papers, not software. The value of software in academia is zero. The data you produce with that software is evaluated, once you write a paper on it (which takes a lot of editorial time).
In most cases, however, research groups have recognized frequent patterns, which can be polished, tested and archived as internal knowledge. This is what I do with my personal toolkit. I grow it according to my research needs, only with those features that are "cross-project". Developing a personal toolkit is almost a requirement, as your scientific needs are most likely unique for some verse (otherwise you would not be doing research) and you want to have as low amount of external dependencies as possible (since if something evolves and breaks your stuff, you will not have the time to fix it).
Everything else, however, is too specific for a given project to be crystallized. I therefore tend not to encapsulate something that is clearly a one-time solver. I do, however, go back and improve it if, later on, other projects require the same piece of code.
Short project span, and the heat of research (e.g. the publish or perish vision so central today), requires agile, quick languages, and in general, languages that can be grasped quickly. Ph.Ds in genomics and quantum chemistry don't have formal programming background. In some cases, they don't even like it. So the language must be quick, easy, clean, flexible, and easy to understand later on. The latter point is capital, as there's no time to produce documentation, and it's guaranteed that in academia, everyone will leave sooner or later, you burn the group experience to zero every three years or so. Academia is a high risk industry that periodically fires all their hard-formed executors, keeping only some managers. Having a code that is maintainable and can be easily grasped by someone else is therefore capital. Also, never underestimate the power of a google search to solve your problems. With a well deployed language you are more likely to find answers to gotchas and issues you can stumble on.
Management is a problem as well. Waterfall is out of discussion. There is no time for paperwork programming (requirements, specs, design). Spiral is quite ok, but as low paperwork as possible is clearly recommended. Fact is that anything that does not give you an article in academia is wasted time. If you spend one month writing specs, it's a month wasted, and your contract expires in 11 months. Moreover, that fatty document counts zero or close to zero for your career (as many other things: administration and teaching are two examples). Of course, Agile methods are also out of discussion. Most development is made by groups that are far, and in general have a bunch of other things to do as well. Coding concentration comes in brief bursts during "spare time" between articles, and before or after meetings. The bazaar is the most likely, but the bazaar carries a lot of issues as well.
So, to answer your question, the best strategy is "slow accumulation" of known good software, development in small bursts with a quick and agile method and language. Good coding practices need to be taught during lectures, as good laboratory practices are taught during practical courses (eg. never put water in sulphuric acid, always the opposite)
The hardest part is the transition between "this is just for a paper" and "we're really going to use this."
If you know that the code will only be for a paper, fine, take short cuts. Hardcode everything you can. Don't waste time on extensive validation if the programmer is the only one who will ever run the code. Etc. The problem is when someone says "Great! Now let's use this for real" or "Now let's use it for this entirely different scenario than what it was developed and tested for."
A related challenge is having to explain why the software isn't ready for prime time even though it obviously works, i.e. it's prototype quality and not production quality. What do you mean you need to rewrite it?
I would recommend that you/they read "Clean Code"
http://www.amazon.co.uk/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882/ref=sr_1_1?ie=UTF8&s=books&qid=1251633753&sr=8-1
The basic idea of this book is that if you do not keep the code "clean", eventually the mess will stop you from making any progress.
The kind of big science I do (particle physics) has a small number of large, long-running projects (ROOT and Geant4, for instance). These are developed mostly by actual programming professionals. Using processes that would be recognized by anyone else in the industry.
Then, each collaboration has a number of project-wide programs which are developed collaboratively under the direction of a small number of senior programming scientists. These use at least the basic tools (always version control, often some kind of bug tracking or automated builds).
Finally almost every scientist works on their own programs. Use of process on these programs is very spotty, and they often suffer from all the ills that others have discussed (short lifetimes, poor coding skills, no review, lots of serial maintainers, Not Invented Here Syndrome, etc. etc.). The only advantage that is available here compared to small group science, is that they work with the tools I talked about above, so there is something that you can point to and say "That is what you want to achieve.".
Don't really have that much more to add to what has already been said. It's a difficult balance to strike because our priorities are different - academia is all about discovering new things, software engineering is more about getting things done according to specifications.
The most important thing I can think of is to try and extricate yourself from the culture of in-house development that goes on in academia and try to maintain a disciplined approach to development, difficult as that may be in many cases owing to time restraints, lack of experience etc. This control-freakery sucks away at individual responsibility and decision-making and leaves it in the hands of a few who do not necessarily know best
Get a good book on software development, Code Complete already mention is excellent, as well as any respected book on algorithms and data structures. Read up on how you will need to manage your data eg do you need fast lookup / hash-tables / binary trees. Don't reinvent the wheel - use the libraries and things like STL otherwise you are likely to be wasting time. There is a vast amount on the web including this very fine blog.
Many academics, besides sometimes being primadonna-ish and precious about any approach seen as businesslike, tend to be quite vague in their objectives. To put it mildly. For this reason alone it is vital to build up your own software arsenal of helper functions and recipes, eventually, hopefully ending up with a kind of flexible experimental framework that enables you to try out any combination of things without being to restricted to any particular problem area. Strongly resist the temptation to just dive into the problem at hand.

Resources