Comparison of 2 text files: what and where changes were made? - algorithm

imagine you have 2 texfiles (let's say 500kB - 3 MB large): the first is original, the second is the update of this original. How can I find out, what was changed (inserted, deleted) and where the changes took place (in the update file in comparison to original)?
Is there any tool or library somewhere?
Resides this function in any well known text editors?
Does anybody know an algorithm? Or what are the common methods to solve it on the large scale?
What would you do if you face this kind of problem?
Thanx for your ideas...

What you're describing sounds exactly like a diff-style tool. This sort of functionality is available in many of the more advanced text editors.

You can try Notepad++ it is an open source text editor that has a compare files plug in.

There is an extensive list of file comparison tools on wikipedia.
If you want to do it programatically I've used SED and AWK on Unix systems before now - and there are windows versions. Basically these types of file processing languages allow you to read and compare text files on a line-by-line basis and then allow you to do something with the differences (for example save them to a third file).

The unix diff tool does line-by-line differences; there is a GNU tool called wdiff which will do word-by-word differences, and should be available as a package for most Linux distributions or Cygwin.
Classic papers on the algorithm are:
"An Algorithm for Differential File Comparison" - James W. Hunt and M. Douglas McIlroy (1976)
"An O(ND) Difference Algorithm and Its Variations" - Eugene W. Myers (1986)

GNU Diffutils http://www.gnu.org/software/diffutils/

Is there any tool or library somewhere?
There are many. Try using diff, it's a command line based file comparison utility that works fine for small diffs. But if the two file differs a lot, it'll be hard to understand the output of diff. In that case you can use visual file diff tools like diffmerge, Kompare or vimdiff.
Resides this function in any well known text editors?
Many modern editors like vim, Eclipse have this visual diffing feature..
Does anybody know an algorithm? Or what are the common methods to solve it on the large scale?
It is based on the Longest common subsequence algorithm, popularly known as LCS.
LCS of old text and new text gives the part that has remain unchanged. So the parts of old text that is not part of LCS is the one that got changed.
What would you do if you face this kind of problem?
I'd use one of the visual diff tools mentioned to see what and where the changes were made.

Related

OpenSCAD Troubleshooting Method : How (strategies or tools) can I find and plug a “leak” in a 3D model or in an included STL file?

I receive occasional “ERROR: The given mesh is not closed! Unable to convert to CGAL_Nef_Polyhedron.” messages from openSCAD. I have a hard time finding the origin of the problem but I suspect it might come from STL files I included in my model⁽¹⁾.
So, outside the recommended best practice in the code writing, to avoid shared surfaces, what are the strategies or tools I can use to find WHERE those leaks are (and how can I “plug” them) ?
(1) I made those STL file myself with openSCAD, from other STL files I transformed with tinkercad, and the making included taking a cut to extract writings (both sides : writing + negative of the writing), combine them with cones (minkowski), etc. - and my code itself is quite complex. So there are many possible sources for this problem and I'm looking for ways to isolate them.
Edit : Someone on a group suggested the Meshlab software to analyse the STL files.
Someone on a group suggested the Meshlab software to analyse the STL files.
Meshlab seems to be a good idea to me. If you go in the 'filters' menu, there is a 'Cleaning an repairing' section where you could find your solution.

CastaliaResults to plot graphs

I am learning to use Castalia for WSN simulation.
I realized that in the user manual, the YYYYMMDD serial number that has been referred to in the examples of obtaining the CastaliaResults was "100809-004640.txt".
However, in the CastaliaPlot to plot the graph, a new number "101209-235427.txt" is being referred to.
Can anybody help with the source of this new number in Castalia 3.2 manual?
As you realised the output files are named with a YYMMDD-HHmmss.txt convention
The name of the actual file does not matter much. They are used as examples in the manual. Obviously you will not have the same filenames when trying out the commands on your own.
In the specific example you mention, it would be better (more consistent) if the manual used the same filename, but it does not matter much.
I wrote the manual. I do not remember exactly what I did all these years back, but the simple explanation for the new filename is that I was rewriting only that part of the manual (the graphs part), so I used some more recent output files to draw the graphs. Probably you can find instances of this minor inconsistency in other places in the manual as well.

Where can I find the diff algorithm?

Where can I find an explanation and implementation of the diff algorithm?
First of all I have to recognize that I'm not sure if this is the correct name of the algorithm. For example, how does Stack Overflow mark the differences between two edits of the same question?
PS: I know the C and PHP programming languages.
There is really no such thing as "the diff algorithm". There are many different diff algorithms, and in fact the particular diff algorithms used are in some cases considered a business advantage of the particular diff tool.
In general, many diff algorithms are based on the Longest Common Subsequence (LCS) problem.
The original Unix diff program from the 1970s was written by Doug McIllroy and uses what is known as the Hunt-McIllroy algorithm. Almost 40 years later, extensions and derivatives of that algorithm are still very common.
A couple of years ago, Bram Cohen (creator of the most successful filesharing program and the least successful version control system) created the Patience Diff algorithm that is designed to give more human-readable results than LCS. It was originally implemented in the Bazaar VCS and also added to Git as an option.
However, unless you are interested in research on diff algorithms, your best bet would probably be to just use some existing diff library like Davide Libenzi's LibXDiff, which is for example what Git uses. I wouldn't be too surprised if there was already a PHP extension wrapping it. An nice alternative is Google's Diff-Match-Patch library, which is used in Bespin or WhiteRoom, for example and which is available for many languages. It uses the Meyers Diff Algorithm plus some pre- and post-processing for additional speedups.
A completely different approach, if you are more interested in merging than diffing, is called Operational Transformations. The idea of OT is that instead of figuring out the differences between two documents, you try to "reverse engineer" the operations that led to those differences. This allows for much better merging, because you can then "replay" those operations. These are most useful for real-time collaborative editors such as EtherPad, Google Wave or SubEthaEdit.
What is wrong with wikipedia where it states it is Hunt-McIlroy algorithm?
There is OCR'd paper describing the algorithm (explanation) and you can inspect source (implementation).
The so related questions also list (among others):
'Best' Diff Algorithm
How do document diff algorithms work?
Diff Algorithm
which all seem to be useful.

Three Way Merge Algorithms for Text

So I've been working on a wiki type site. What I'm trying to decide on is what the best algorithm for merging an article that is simultaneously being edited by two users.
So far I'm considering using Wikipedia's method of merging the documents if two unrelated areas are edited, but throwing away the older change if two commits conflict.
My question is as follows: If I have the original article, and two changes to it, what are the best algorithms to merge them and then deal with conflicts as they arise?
Bill Ritcher's excellent paper "A Trustworthy 3-Way Merge" talks about some of the common gotchas with three way merging and clever solutions to them that commercial SCM packages have used.
The 3-way merge will automatically apply all the changes (which are not overlapping) from each version. The trick is to automatically handle as many almost overlapping regions as possible.
There's a formal analysis of the diff3 algorithm, with pseudocode, in this paper:
http://www.cis.upenn.edu/~bcpierce/papers/diff3-short.pdf
It is titled "A Formal Investigation of Diff3" and written by Sanjeev Khanna, Keshav Kunal, and Benjamin C. Pierce from Yahoo.
Frankly, I'd rely on diff3. It's on pretty much every Unix distro, and you can always build and bundle an .EXE for Windows to ensure it is there for your purposes.

Finding patterns in source code

If I wanted to learn about pattern recognition in general what would be a good place to start (recommend a book)?
Also, does anybody have any experience/knowledge on how to go about applying these algorithms to find abstraction patterns in programs? (repeated code, chunks of code that do the same thing, but in slightly different ways, etc.)
Thanks
Edit: I don't mind mathematically intensive books. In fact, that would be a good thing.
If you are reasonably mathematically confident then either of Chris Bishop's books "Pattern Recognition and Machine Learning" or "Neural Networks for Pattern Recognition" are very good for learning about pattern recognition.
It helps if you have access to the parse tree generated during compilation. This way you can look for pieces of the tree which are similar, ignoring the nodes which are deeper than what you are looking at, this way you can pick out e.g. nodes which multiply together two sub-expressions, ignoring the contents of the sub-expressions. You can apply the same logic to a collection of nodes, e.g. you want to find a multiplication of two sub-expressions where those two sub-expressions are additions of more sub-expressions. You first look for multiplies, then check if the two nodes underneath the multiply are additions, ignoring anything any deeper.
I'd suggest looking at the code of some open source project (e.g. FindBugs or SIM)
that does the kind of thing you're talking about.
If you're working in one of the supported languages, IntelliJ idea has a really smart structural search and replace that would fit your problem.
Other interesting projects are PMD and Eclipse.
Eclipse uses AST (abstract syntax trees) for all source code in any project. Tools can then register for certain types of ASTs (like Java source) and get a preprocessed view where they can add additional information (like links to documentation, error markers, etc).
Another project you can look into is Duplo - it's an open-source/GPL project, so you can pore over their approach by grabbing the code from SourceForge.
This is specific to .Net and visual studio, but it finds duplicate code in your project. It does report some false positives I've found but it could be a good place to start.
Clone Detective
One kind of pattern is code that has been cloned by copy and paste methods. See CloneDR for a tool that automatically finds such code in spite of variations in layout and even changes in the body of the clone, by comparing abstract syntax trees for the language in question.
CloneDR works with a variety of langauges: C, C++, C#, Java, JavaScript, PHP, COBOL, Python, ... The website shows clone detection reports for a variety of programming languages.

Resources