Three Way Merge Algorithms for Text - algorithm

So I've been working on a wiki type site. What I'm trying to decide on is what the best algorithm for merging an article that is simultaneously being edited by two users.
So far I'm considering using Wikipedia's method of merging the documents if two unrelated areas are edited, but throwing away the older change if two commits conflict.
My question is as follows: If I have the original article, and two changes to it, what are the best algorithms to merge them and then deal with conflicts as they arise?

Bill Ritcher's excellent paper "A Trustworthy 3-Way Merge" talks about some of the common gotchas with three way merging and clever solutions to them that commercial SCM packages have used.
The 3-way merge will automatically apply all the changes (which are not overlapping) from each version. The trick is to automatically handle as many almost overlapping regions as possible.

There's a formal analysis of the diff3 algorithm, with pseudocode, in this paper:
http://www.cis.upenn.edu/~bcpierce/papers/diff3-short.pdf
It is titled "A Formal Investigation of Diff3" and written by Sanjeev Khanna, Keshav Kunal, and Benjamin C. Pierce from Yahoo.

Frankly, I'd rely on diff3. It's on pretty much every Unix distro, and you can always build and bundle an .EXE for Windows to ensure it is there for your purposes.

Related

What is a collaborative algorithm?

What is a collaborative algorithm? Is there a scientifically citable reference?
Details:
I found many articles about collaborative algorithms, but none (or other websites) with a definition.
I am actually looking for a term to describe distributed algorithms where each instance has all information at the beginning and can complete the whole task on its own, but the instances help each other whenever they have solved a sub-problem, so the other instances do not have to redo the work (hence "collaboration"). I picked up this terminology in A Collaborative Approach for Multi-Threaded SAT Solving. Do you think the term "collaborative algorithm" is suitable for this? If not, do you know of a better term?
No, there's no scientifically citable references.
All parallel/distributed programming is "collaborative" in a sense that several threads/nodes are collaborating on the same big task.
distributed algorithms where.. instances help each other whenever they have solved a sub-problem - even some web application clusters fit your description: individual cluster nodes "solve subproblems" and store the "solutions" in a distributed in-RAM storage (such as memcached or cassandra or many others) thus helping each other.
I think the term "collaborative algorithm" is not formal.
Actually the term "algorithm" itself is not really formal,
as far as I remember. I guess algorithm can be formalized as
"a program which runs on a Turing machine". I think I've seen
this definition somewhere.
So yes, I guess all in all the term you coined makes sense, but you
need to define it somehow yourself (either formally or informally).
Not sure what your background is but ... OK, in scientific papers
different authors sometimes use same terms/concepts for denoting
different things and sometimes they use different terms for
denoting the same thing.
Also, even though computer science papers are scientific not
all terms in them are formally defined. So I wouldn't draw
too many conclusions based on these papers unless I am familiar
to a decent extent with all of them or unless some of them
are considered really remarkable and widely accepted as
a de-facto standard in a particular sub-field or field.

Where can I find the diff algorithm?

Where can I find an explanation and implementation of the diff algorithm?
First of all I have to recognize that I'm not sure if this is the correct name of the algorithm. For example, how does Stack Overflow mark the differences between two edits of the same question?
PS: I know the C and PHP programming languages.
There is really no such thing as "the diff algorithm". There are many different diff algorithms, and in fact the particular diff algorithms used are in some cases considered a business advantage of the particular diff tool.
In general, many diff algorithms are based on the Longest Common Subsequence (LCS) problem.
The original Unix diff program from the 1970s was written by Doug McIllroy and uses what is known as the Hunt-McIllroy algorithm. Almost 40 years later, extensions and derivatives of that algorithm are still very common.
A couple of years ago, Bram Cohen (creator of the most successful filesharing program and the least successful version control system) created the Patience Diff algorithm that is designed to give more human-readable results than LCS. It was originally implemented in the Bazaar VCS and also added to Git as an option.
However, unless you are interested in research on diff algorithms, your best bet would probably be to just use some existing diff library like Davide Libenzi's LibXDiff, which is for example what Git uses. I wouldn't be too surprised if there was already a PHP extension wrapping it. An nice alternative is Google's Diff-Match-Patch library, which is used in Bespin or WhiteRoom, for example and which is available for many languages. It uses the Meyers Diff Algorithm plus some pre- and post-processing for additional speedups.
A completely different approach, if you are more interested in merging than diffing, is called Operational Transformations. The idea of OT is that instead of figuring out the differences between two documents, you try to "reverse engineer" the operations that led to those differences. This allows for much better merging, because you can then "replay" those operations. These are most useful for real-time collaborative editors such as EtherPad, Google Wave or SubEthaEdit.
What is wrong with wikipedia where it states it is Hunt-McIlroy algorithm?
There is OCR'd paper describing the algorithm (explanation) and you can inspect source (implementation).
The so related questions also list (among others):
'Best' Diff Algorithm
How do document diff algorithms work?
Diff Algorithm
which all seem to be useful.

How to cultivate algorithm intuition?

When faced with a problem in software I usually see a solution right away. Of course, what I see is usually somewhat off, and I always need to sit down and design (admittedly, I usually don't design enough), but I get a certain intuition right away.
My problem is I don't get that same intuition when it comes to advanced algorithms. I feel much more up to the task of building another Facebook then building another Google search, or a Music Genom project. It's probably because I've been building software for quite some time, but I have little experience with composing algorithms.
I would like the community's advice on what to read and what projects to undertake to be better at composing algorithms.
(This question has nothing to do with Algorithmic composition. Well, almost nothing)
+1 To whoever said experience is the best teacher.
There are several online portals which have a lot of programming problems, that you can submit your own solutions to, and get an automated pass/fail indication.
http://www.spoj.pl/
http://uva.onlinejudge.org/
http://www.topcoder.com/tc
http://code.google.com/codejam/contests.html
http://projecteuler.net/
https://codeforces.com
https://leetcode.com
The USACO training site is the training program that all USA computing olympiad participants go through. It goes step by step, introducing more and more complex algorithms as you go.
You might find it helpful to perform algorithms physically. For example, when you're studying sorting algorithms, practice doing each one with a deck of cards. That will activate different parts of your brain than reading or programming alone will.
Steve Yegge referred to "The Algorithm Design Manual" in one of his rants. I haven't seen it myself, but it sounds like it's just the ticket from his description.
My absolute favorite for this kind of interview preparation is Steven Skiena's The Algorithm Design Manual. More than any other book it helped me understand just how astonishingly commonplace (and important) graph problems are – they should be part of every working programmer's toolkit. The book also covers basic data structures and sorting algorithms, which is a nice bonus. But the gold mine is the second half of the book, which is a sort of encyclopedia of 1-pagers on zillions of useful problems and various ways to solve them, without too much detail. Almost every 1-pager has a simple picture, making it easy to remember. This is a great way to learn how to identify hundreds of problem types.
problem domain
First you must understand the problem domain. An elegant solution to the wrong problem is no good, nor is an inefficient solution to the right problem in most cases. Solution quality, in other words, is often relative. A simple scheduling problem that has a deterministic solution that takes ten minutes to run may be fine if schedules are realculated once per week, but if schedules change several times a day then a genetic algorithm solution that converges in a few seconds may be required.
decomposition and mapping
Second, decompose the problem into sub-problems and known/unknown elements that correspond to elements of the solution. Sometimes this is obvious, e.g. to count widgets you need a way of identifying widgets, an incrementable counter, and a way of storing the count. Sometimes it is not so obvious. Sometimes you have to decompose the problem, the domain, and possible solutions at the same time and try several different mappings between them to find one that leads to the correct results [this is the general method].
model
Model the solution, in your head at least, and walk through it to see if it works correctly. Adjust as necessary (See decomposition and mapping, above).
composition/interfaces
Many times you can find elements of the problem and elements of the solution that map to each other and produce partial results that are useful. This composition and interface construction provides the kernal of the solution, and also serves to reduce the scope of the problem remaining. So then you just loop back to the top with a smaller initial problem, and go through it again.
experience
Experience is the best teacher, of course, but reading about different kinds of problems and solutions will also be helpful. Studying some of the well-known algorithms and their applications is likewise very helpful, e.g. Dijkstra, Bresenham, Unification, and of course, graph theory.
I am not sure intuition can be cultivated, but I think I know what you are asking. The more problems you solve, the more information and experience you have at your disposal for future problems. So, I say just practice. Practice programming real world applications and you run into plenty of problems. Sometimes, solving puzzles can be very educational as well.
I try to find physical analogues when I'm looking at a complex problem.

Minimum CompSci Knowledge Needed for Writing Desktop Apps

Having been a hobbyist programmer for 3 years (mainly Python and C) and never having written an application longer than 500 lines of code, I find myself faced with two choices :
(1) Learn the essentials of data structures and algorithm design so I can become a l33t computer scientist.
(2) Learn Qt, which would help me build projects I have been itching to build for a long time.
For learning (1), everyone seems to recommend reading CLRS. Unfortunately, reading CLRS would take me at least an year of study (or more, I'm not Peter Krumins). I also understand that to accomplish any moderately complex task using (2), I will need to understand at least the fundamentals of (1), which brings me to my question : assuming I use C++ as the programming language of choice, which parts of CLRS would give me sufficient knowledge of algorithms and data structures to work on large projects using (2)?
In other words, I need a list of theoretical CompSci topics absolutely essential for everyday application programming tasks. Also, I want to use CLRS as a handy reference, so I don't want to skip any material critical to understanding the later sections of the book.
Don't get me wrong here. Discrete math and the theoretical underpinnings of CompSci have been on my "TODO: URGENT" list for about 6 months now, but I just don't have enough time owing to college work. After a long time, I have 15 days off to do whatever the hell I like, and I want to spend these 15 days building applications I really want to build rather than sitting at my desk, pen and paper in hand, trying to write down the solution to a textbook problem.
(BTW, a less-math-more-code resource on algorithms will be highly appreciated. I'm just out of high school and my math is not at the level it should be.)
Thanks :)
This could be considered heresy, but the vast majority of application code does not require much understanding of algorithms and data structures. Most languages provide libraries which contain collection classes, searching and sorting algorithms, etc. You generally don't need to understand the theory behind how these work, just use them!
However, if you've never written anything longer than 500 lines, then there are a lot of things you DO need to learn, such as how to write your application's code so that it's flexible, maintainable, etc.
For a less-math, more code resource on algorithms than CLRS, check out Algorithms in a Nutshell. If you're going to be writing desktop applications, I don't consider CLRS to be required reading. If you're using C++ I think Sedgewick is a more appropriate choice.
Try some online comp sci courses. Berkeley has some, as does MIT. Software engineering radio is a great podcast also.
See these questions as well:
What are some good computer science resources for a blind programmer?
https://stackoverflow.com/questions/360542/plumber-programmers-vs-computer-scientists#360554
Heed the wisdom of Don and just do it. Can you define the features that you want your application to have? Can you break those features down into smaller tasks? Can you organize the code produced by those tasks into a coherent structure?
Of course you can. Identify any 'risky' areas (areas that you do not understand, e.g. something that requires more math than you know, or special algorithms you would have to research) and either find another solution, prototype a solution, or come back to SO and ask specific questions.
Moving from 500 loc to a real (eve if small) application it's not that easy.
As Don was pointing out, you'll need to learn a lot of things about code (flexibility, reuse, etc), you need to learn some very basic of configuration management as well (visual source safe, svn?)
But the main issue is that you need a way to don't be overwhelmed by your functiononalities/code pair. That it's not easy. What I can suggest you is to put in place something to 'automatically' test your code (even in a very basic way) via some regression tests. Otherwise it's going to be hard.
As you can see I think it's no related at all to data structure, algorithms or whatever.
Good luck and let us know
I must say that sitting down with a dry old textbook and reading it through is not the way to learn how to do anything effectively, even if you are making notes. Doing it is the best way to learn, using the textbooks as a reference. Indeed, using sites like this as a reference.
As for data structures - learn which one is good for whatever situation you envision: Sets (sorted and unsorted), Lists (ArrayList, LinkedList), Maps (HashMap, TreeMap). Complexity of doing basic operations - adding, removing, searching, sorting, etc. That will help you to select an appropriate library data structure to use in your application.
And also make sure you're reasonably warm with MVC - i.e., ensure your model is separate from your view (the QT front-end) as best as possible. Best would be to have the model and algorithms working on their own, and then put the GUI on top. Or a unit test on top. Etc...
Good luck!
It's like saying you want to move to France, so should you learn french from a book, and what are the essential words - or should you just go to France and find out which words you need to know from experience and from copying the locals.
Writing code is part of learning computer science. I was writing code long before I'd even heard of the term, and lots of people were writing code before the term was invented.
Besides, you say you're itching to write certain applications. That can't be taught, so just go ahead and do it. Some things you only learn by doing.
(The theoretical foundations will just give you a deeper understanding of what you wind up doing anyway, which will mainly be copying other people's approaches. The only caveat is that in some cases the theoretical stuff will tell you what's futile to attempt - e.g. if one of your itches is to solve an NP complete problem, you probably won't succeed :-)
I would say the practical aspects of coding are more important. In particular, source control is vital if you don't use that already. I like bzr as an easy to set up and use system, though GUI support isn't as mature as it could be.
I'd then move on to one or both of the classics about the craft of coding, namely
The Pragmatic Programmer
Code Complete 2
You could also check out the list of recommended books on Stack Overflow.

Have you ever used a genetic algorithm in real-world applications?

I was wondering how common it is to find genetic algorithm approaches in commercial code.
It always seemed to me that some kinds of schedulers could benefit from a GA engine, as a supplement to the main algorithm.
Genetic Algorithms have been widely used commercially. Optimizing train routing was an early application. More recently fighter planes have used GAs to optimize wing designs. I have used GAs extensively at work to generate solutions to problems that have an extremely large search space.
Many problems are unlikely to benefit from GAs. I disagree with Thomas that they are too hard to understand. A GA is actually very simple. We found that there is a huge amount of knowledge to be gained from optimizing the GA to a particular problem that might be difficult and as always managing large amounts of parallel computation continue to be a problem for many programmers.
A problem that would benefit from a GA is going to have the following characteristics:
A good way to encode potential solutions
A way to compute an a numerical score to evaluate the quality of the solution
A large multi-dimensional search space where the answer is non-obvious
A good solution is good enough and a perfect solution is not required
There are many problems that could probably benefit from GAs and in the future they will probably be more widely deployed. I believe that GAs are used in cutting edge engineering more than people think however most people (like my company does) guards those secrets extremely closely. It is only long after the fact that it is revealed that GAs were used.
Most people that deal with "normal" applications probably don't have much use for them though.
If you want to find an example, look at Postgres's Query Planner. It uses many techniques, and one just so happens to be genetic.
http://developer.postgresql.org/pgdocs/postgres/geqo-pg-intro.html
I used GA in my Master's thesis, but after that I haven't found anything in my daily work a GA could solve that I couldn't solve faster with some other Algorithm.
I don't think it is particularly common to find genetic algorithms in everyday-commercial code. They are more commonly found in academic/research code where the need to find the "best algorithm" is less important than the need to just find a good solution to a problem.
Nonetheless, I have consulted on a couple of commercial projects that do use GAs (chiefly as a result of my involvement with GAUL). I think the most interesting example was at a Biotech company. They used the GA to optimise scoring functions that were used for virtual screening, as part of their drug discovery application.
Earlier this year, with my current company, I added a new feature to one of our products that uses another GA. I think we might be marketing this from next month. Basically, the GA is used to explore molecules that have the potential for binding to a protein, and could therefore be further investigated as drugs targeting that protein. A competing product that also uses a GA is EA inventor.
As part of my thesis I wrote a generic java framework for the multi-objective optimisation algorithm mPOEMS (Multiobjective prototype optimization with evolved improvement steps), which is a GA using evolutionary concepts. It is generic in a way that all problem-independent parts have been separated from the problem-dependent parts, and an interface is povided to use the framework with only adding the problem-dependent parts. Thus one who wants to use the algorithm does not have to begin from zero, and it facilitates work a lot.
You can find the code here.
The solutions which you can find with this algorithm have been compared in a scientific work with state-of-the-art algorithms SPEA-2 and NSGA, and it has been proven that
the algorithm performes comparable or even better, depending on the metrics you take to measure the performance, and especially depending on the optimization-problem you are looking on.
You can find it here.
Also as part of my thesis and proof of work I applied this framework to the project selection problem found in portfolio management. It is about selecting the projects which add the most value to the company, support most the strategy of the company or support any other arbitrary goal. E.g. selection of a certain number of projects from a specific category, or maximization of project synergies, ...
My thesis which applies this framework to the project selection problem:
http://www.ub.tuwien.ac.at/dipl/2008/AC05038968.pdf
After that I worked in a portfolio management department in one of the fortune 500, where they used a commercial software which also applied a GA to the project selection problem / portfolio optimization.
Further resources:
The documentation of the framework:
http://thomaskremmel.com/mpoems/mpoems_in_java_documentation.pdf
mPOEMS presentation paper:
http://portal.acm.org/citation.cfm?id=1792634.1792653
Actually with a bit of enthusiasm everybody could easily adapt the code of the generic framework to an arbitrary multi-objective optimisation problem.
I haven't but I've heard of this company (can't remember their name) which uses mutating, genetic algos to calculate placements and lengths of antennas (or something) from a friend of mine. And they're supposed to (according to my friend) have huge success with this. I guess GA is just too complex for "average Joe developer" to become mainstream. Kind of like Map Reduce - spectacularly cool, but WAY too advanced to hit the "mainstream"...
LibreOffice Calc uses it in its Solver module.

Resources