Automate Finding Pertinent Methods in Large Project - refactoring

I have tried to be disciplined about decomposing into small reusable methods when possible. As the project growing, I am re-implementing the exact same method.
I would like to know how to deal with this in an automated way. I am not looking for an IDE specific solution. Dependency on method names may not be sufficient. Unix and scripting are solutions that would be extremely beneficial. Answers such as "take care" etc. are not the solutions I am seeking.

I think the cheapest solution to implement might be to use Google Desktop. A more accurate solution would probably be much harder to implement - treat your code base as a collection of documents where the identifiers (or tokens in the identifiers) are words of the document, and then use document clustering techniques to find the closest matching code to a query. I'm aware of some research similar to that, but nothing close to out-of-the-box code that you could use. You might try looking on Google Code Search for something. I don't think they offer a desktop version, but you might be able to find some applicable code you can adapt.
Edit: And here's a list of somebody's favorite code search engines. I don't know whether any are adaptable for local use.
Edit2: Source Code Search Engine is mentioned in the comments following the list of code search engines. It appears to be a commercial product (with a free evaluation version) that is intended to be used for searching local code.

Related

How can I efficiently find all people mentioned in some text, while tolerating spelling mistakes?

I have a list of names of millions of famous people (from Wikidata), and I need to create a system that efficiently finds all people mentioned in a fairly short text: it can be just one word (eg. "Einstein") to a few pages of text (eg. a Wikipedia page).
I need the system to be fairly tolerant to spelling mistakes (eg. Mikael Jackson instead of Michael Jackson), and short forms (eg. M. Jackson). In case of ambiguity, it should return all possible people (eg. "George Bush" should return both father and son, and possibly other homonyms).
This related question has a few interesting answers, including using the Aho-Corasick algorithm. There are libraries in many languages, including in Python. However, it does not seem to support fuzzy search (ie. tolerate misspellings).
I guess I could extend the vocabulary to include all the possible spellings of each name, but that would make the vocabulary too large, so I would rather avoid that if possible (moreover, I may want to extend this solution to more than just people at one point).
I took a quick look at Lucene/ElasticSearch but it does not seem to support this use case (unless I missed it).
Any ideas?
Elasticsearch has support for fuzzy matching: See documentation here.

How to do algorithm visualization?

I am looking for an algorithm visualization library/tool that is well documented and you can call from your source code.
I took a look at jhave - example of usage. And I liked it, it seems it has some documentation but I do not trust its future.
I found this article about Algorithm explorer it has a nice idea. It is implemented as a c++ api but I cannot find it anywere.
My main idea is that I want to do some unit tests for the brain.
So I construct various exercises and in future when I want to test my knowledge I redo them.
I found that images stick longer with me, so that is why I want to visualize algorithms in certain states. ( I might remember better a tricky case like what happens when data is sorted in reverse and I use quick sort if I view it.)
An ideal tool:
1. Has to integrate with any language.
2. Has to be well documented with a growing comunity and examples.
3. Be implemented on top of a capable rendering engine(ogre, xna).
Here is the place you need to visit: The Algorithm Visualization Portal!

How do you decide whether to use a library or write your own implementation

Inspired by this question which started out innocently but is turning into a major flame war.
Let's say you need to a utility method - reasonably straightforward but not a one-liner. Quoted question was how to repeat a string X times. How do you decide whether to use a 3rd party implementation or write your own?
The obvious downside to 3rd party approach is you're adding a dependency to your code.
But if you're writing your own you need to code it, test it, (maybe) profile it so you'll likely end up spending more time.
I know the decision itself is subjective, but criteria you use to arrive at it should not be.
So, what criteria do you use to decide when to write your own code?
General Decision
Before deciding on what to use, I will create a list of criteria that must be met by the library. This could include size, simplicity, integration points, speed, problem complexity, dependencies, external constraints, and license. Depending on the situation the factors involved in making the decision will differ.
Generally, I will hunt for a suitable library that solves the problem before writing my own implementation. If I have to write my own, I will read up on appropriate algorithms and seek ideas from other implementations (e.g., in a different language).
If, after all the aspects described below, I can find no suitable library or source code, and I have searched (and asked on suitable forums), then I will develop my own implementation.
Complexity
If the task is relatively simple (e.g., a MultiValueMap class), then:
Find an existing open-source implementation.
Integrate the code.
Rewrite it, or trim it down, if it excessive.
If the task is complex (e.g., a flexible object-oriented graphing library), then:
Find an open-source implementation that compiles (out-of-the-box).
Execute its "Hello, world!" equivalent.
Perform any other evaluations as required.
Determine its suitability based on the problem domain criteria.
Speed
If the library is too slow, then:
Profile it.
Optimize it.
Contribute the results back to the community.
If the code is too complex to be optimized, and speed is a factor, discuss it with the community and provide profiling details. Otherwise, look for an equivalent, but faster (possibly less feature-rich) library.
API
If the API is not simple, then:
Write a facade and contribute it back to the community.
Or find a simpler API.
Size
If the compiled library is too large, then:
Compile only the necessary source files.
Or find a smaller library.
Bugs
If the library does not compile out of the box, seek alternatives.
Dependencies
If the library depends on scores of external libraries, seek alternatives.
Documentation
If there is insufficient documentation (e.g., user manuals, installation guides, examples, source code comments), seek alternatives.
Time Constraints
If there is ample time to find an optimal solution, then do so. Often there is not sufficient time to write from scratch. And usually there are a number of similar libraries to evaluate. Keep in mind that, by meticulous loose coupling, you can always swap one library for another. Find what works, initially, and if it later becomes a burden, replace it.
Development Environment
If the library is tied to a specific development environment, seek alternatives.
License
Open source.
10 questions ...
+++ (use library) ... --- (write own library)
Is the library exactly what I need? Customizable in a few steps? +++
Does it provide almost all functionality? Easily extensible? +++
No time? +++
It's good for one half and plays well with other? ++
Hard to extend, but excellent documentation? ++
Hard to extend, yet most of the functionality? +
Functionality ok, but outdated? -
Functionality ok, .. but weird (crazy interface, not robust, ...)? --
Library works, but the person who needs to decide is in the state of hybris? ---
Library works, manageable code size, portfolio needs update? ---
Some thoughts ...
If it is something that is small but useful, probably for others, too, then why now write a library and put it on the web. The cost publishing this kind of small libraries decreased, as well as the hurdle for others to tune in (see bitbucket or github). So what's the criteria?
Maybe it should not exactly replicate an existing already known library. If it replicates something existing, it should approach the problem from new angle, or better it should provide a shorter or more condensed* solution.
*/fun
If it's a trivial function, it's not worth pulling in an entire library.
If it's a non-trivial function, then it may be worth it.
If it's multiple functions which can all be handled by pulling in a single library, it's almost definitely worth it.
Keep it in balance
You should keep several criteria in balance. I'd consider a few topics and ask a few questions.
Developing time VS maintenance time
Can I develop what I need in a few hours? If yes, why do I need a library? If I get a lib am I sure that it will not cause hours spent to debug and documentation reading? The answer - if I need something obvious and straightforward I don't need an extra-flexible lib.
Simplicity VS flexibility
If I need just an error wrapper do I need a lib with flexible types and stack tracing and color prints and.... Nope! Using even beautifully designed but flexible and multipurpose libs could slow your code. If you plan to use 2% of functionality you don't need it.
Dig task VS small task
Did I faced a huge task and I need external code to solve it? Definitely AMQP or SQL operations is too big tasks to develop from scratch but tiny logging could be solved in place. Don't use external libs to solve small tasks.
My own libs VS external libs
Sometimes is better to grow your own library because it is for 100% used, for 100% appropriate your goals, you know it best, it is always up to date with your applications. Don't build your own lib just to be cool and keep in mind that a lot of libs in your vendor directory developed "just to be cool".
For me this would be a fairly easy answer.
If you need to be cost effective, then it would probably be best to try and find a library/framework that does what you want. If you can't find it, then you will be forced to write it or find a different approach.
If you have the time and find it fun, write one. You will learn a lot along the way and you can give back to the open source community with you killer new bundle of code. If you don't, well, then don't. But if you can't find one, then you have to write it anyway ;)
Personally, if I can justify writing a library, I always opt for that. It's fun, you learn a lot about what you are directing your focus towards, and you have another tool to add to your arsenal and put on your CV.
If the functionality is only a small part of the app, or if your needs are the same as everyone else's, then a library is probably the way to go. If you need to consume and output JSON, for example, you can probably knock something together in five minutes to handle your immediate needs. But then you start adding to it, bit by bit. Eventually, you have all the functionality that you would find in any library, but 1) you had to write it yourself and 2) it isn't a robust and well document as what you would find in a library.
If the functionality is a big part of the app, and if your needs aren't exactly the same as everyone else's, then think much more carefully. For example, if you are doing machine learning, you might consider using a package like Weka or Mahout, but these are two very different beasts, and this component is likely to be a significant part of your application. A library in this case could be a hindrance, because your needs might not fit the design parameters of the original authors, and if you attempt to modify it, you will need to worry about a much larger and more complex system than the minimum that you would build yourself.
There's a good article out there talking about sanitizing HTML, and how it was a big part of the app, and something that would need to be heavily tuned, so using an outside library wasn't the best solution, in spite of the fact that there were many libraries out that did exactly what seemed to be called for.
Another consideration is security.
If a black-hat hacker finds a bug in your code they can create an exploit and sell it for money. The more popular the library is, the more the exploit worth. Think about OpenSSL or Wordpress exploits. If you re-implement the code, chances that your code is not vulnerable exactly the same way the popular library is. And if your lib is not popular, then an zero-day exploit of your code probably wouldn't worth much, and there is a good chance your code is not targeted by bounty hunters.
Another consideration is language safety. C language can be very fast. But from the security standpoint it's asking for trouble. If you reimplement the lib in some script language, chances of arbitrary code execution exploits are low (as long as you know the possible attack vectors, like serialization, or evals).

Finding patterns in source code

If I wanted to learn about pattern recognition in general what would be a good place to start (recommend a book)?
Also, does anybody have any experience/knowledge on how to go about applying these algorithms to find abstraction patterns in programs? (repeated code, chunks of code that do the same thing, but in slightly different ways, etc.)
Thanks
Edit: I don't mind mathematically intensive books. In fact, that would be a good thing.
If you are reasonably mathematically confident then either of Chris Bishop's books "Pattern Recognition and Machine Learning" or "Neural Networks for Pattern Recognition" are very good for learning about pattern recognition.
It helps if you have access to the parse tree generated during compilation. This way you can look for pieces of the tree which are similar, ignoring the nodes which are deeper than what you are looking at, this way you can pick out e.g. nodes which multiply together two sub-expressions, ignoring the contents of the sub-expressions. You can apply the same logic to a collection of nodes, e.g. you want to find a multiplication of two sub-expressions where those two sub-expressions are additions of more sub-expressions. You first look for multiplies, then check if the two nodes underneath the multiply are additions, ignoring anything any deeper.
I'd suggest looking at the code of some open source project (e.g. FindBugs or SIM)
that does the kind of thing you're talking about.
If you're working in one of the supported languages, IntelliJ idea has a really smart structural search and replace that would fit your problem.
Other interesting projects are PMD and Eclipse.
Eclipse uses AST (abstract syntax trees) for all source code in any project. Tools can then register for certain types of ASTs (like Java source) and get a preprocessed view where they can add additional information (like links to documentation, error markers, etc).
Another project you can look into is Duplo - it's an open-source/GPL project, so you can pore over their approach by grabbing the code from SourceForge.
This is specific to .Net and visual studio, but it finds duplicate code in your project. It does report some false positives I've found but it could be a good place to start.
Clone Detective
One kind of pattern is code that has been cloned by copy and paste methods. See CloneDR for a tool that automatically finds such code in spite of variations in layout and even changes in the body of the clone, by comparing abstract syntax trees for the language in question.
CloneDR works with a variety of langauges: C, C++, C#, Java, JavaScript, PHP, COBOL, Python, ... The website shows clone detection reports for a variety of programming languages.

Do you know any patterns for GUI programming? (Not patterns on designing GUIs)

I'm looking for patterns that concern coding parts of a GUI. Not as global as MVC, that I'm quite familiar with, but patterns and good ideas and best practices concerning single controls and inputs.
Let say I want to make a control that display some objects that may overlap. Now if I click on an object, I need to find out what to do (Just finding the object I can do in several ways, such as an quad-tree and Z-order, thats not the problem). And also I might hold down a modifier key, or some object is active from the beginning, making the selection or whatever a bit more complicated. Should I have an object instance representing a screen object, handle the user-action when clicked, or a master class. etc.. What kind of patterns or solutions are there for problems like this?
I think to be honest you a better just boning up on your standard design patterns and applying them to the individual problems that you face in developing your UI.
While there are common UI "themes" (such as dealing with modifier keys) the actual implementation may vary widely.
I have O'Reilly's Head First Design Patterns and The Poster, which I have found invaluable!
Shameless Plug : These links are using my associates ID.
Object-Oriented Design and Patterns by Cay Horstmann has a chapter entitled "Patterns and GUI Programming". In that chapter, Horstmann touches on the following patterns:
Observer Layout Managers and the
Strategy Pattern Components,
Containers, and the Composite Pattern
Scroll Bars and the Decorator Pattern
I don't think the that benefit of design patterns come from trying to find a design pattern to fit a problem. You can however use some heuristics to help clean up your design in this quite a bit, like keeping the UI as decoupled as possible from the rest of the objects in your system.
There is a pattern that might help out in this case, the Observer Pattern.
I know you said not as global as MVC, but there are some variations on MVC - specifically HMVC and PAC - which I think can answer questions such as the ones you pose.
Other than that, try to write new code "in the spirit" of existing patterns even if you don't apply them directly.
perhaps you're looking for something like the 'MouseTrap' which I saw in some articles on codeproject (search for UI Platform)?
I also found this series very useful http://codebetter.com/jeremymiller/2007/07/26/the-build-your-own-cab-series-table-of-contents/ where you might have a look at embedded controllers etc.
Micha.
You are looking at a professional application programming. I searched for tips and tricks a long time, without success. Unfortunately you will not find anything useful, it is a complicated topic and only with many years of experience you will be able to understand how to write efficiently an application. For example, almost every program opens a file, extracts information, shows it in different forms, allow processing, saving, ... but nobody explains exactly what the good strategy is and so on. Further, if you are writing a big application, you need to look at some strategies to reduce your compilation time (otherwise you will wait hours at every compilation). Impls idioms in C++ help you for example. And then there is a lot more. For this reason software developers are well paid and there are so many jobs :-)

Resources