Diff Algorithm? [closed]

Diff Algorithm? [closed] - algorithm

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I've been looking like crazy for an explanation of a diff algorithm that works and is efficient.
The closest I got is this link to RFC 3284 (from several Eric Sink blog posts), which describes in perfectly understandable terms the data format in which the diff results are stored. However, it has no mention whatsoever as to how a program would reach these results while doing a diff.
I'm trying to research this out of personal curiosity, because I'm sure there must be tradeoffs when implementing a diff algorithm, which are pretty clear sometimes when you look at diffs and wonder "why did the diff program chose this as a change instead of that?"...
Where can I find a description of an efficient algorithm that'd end up outputting VCDIFF?
By the way, if you happen to find a description of the actual algorithm used by SourceGear's DiffMerge, that'd be even better.
NOTE: longest common subsequence doesn't seem to be the algorithm used by VCDIFF, it looks like they're doing something smarter, given the data format they use.

An O(ND) Difference Algorithm and its Variations (1986, Eugene W. Myers) is a fantastic paper and you may want to start there. It includes pseudo-code and a nice visualization of the graph traversals involved in doing the diff.
Section 4 of the paper introduces some refinements to the algorithm that make it very effective.
Successfully implementing this will leave you with a very useful tool in your toolbox (and probably some excellent experience as well).
Generating the output format you need can sometimes be tricky, but if you have understanding of the algorithm internals, then you should be able to output anything you need. You can also introduce heuristics to affect the output and make certain tradeoffs.
Here is a page that includes a bit of documentation, full source code, and examples of a diff algorithm using the techniques in the aforementioned algorithm.
The source code appears to follow the basic algorithm closely and is easy to read.
There's also a bit on preparing the input, which you may find useful. There's a huge difference in output when you are diffing by character or token (word).

I would begin by looking at the actual source code for diff, which GNU makes available.
For an understanding of how that source code actually works, the docs in that package reference the papers that inspired it:
The basic algorithm is described in "An O(ND) Difference Algorithm and its Variations", Eugene W. Myers, 'Algorithmica' Vol. 1 No. 2, 1986, pp. 251-266; and in "A File
Comparison Program", Webb Miller and Eugene W. Myers, 'Software--Practice and Experience' Vol. 15 No. 11, 1985, pp. 1025-1040. The algorithm was independently discovered as described in "Algorithms for Approximate String Matching", E. Ukkonen, `Information and Control' Vol. 64, 1985, pp. 100-118.
Reading the papers then looking at the source code for an implementation should be more than enough to understand how it works.

See https://github.com/google/diff-match-patch
"The Diff Match and Patch libraries
offer robust algorithms to perform the
operations required for synchronizing
plain text. ... Currently available
in Java, JavaScript, C++, C# and
Python"
Also see the wikipedia.org Diff page and - "Bram Cohen: The diff problem has been solved"

I came here looking for the diff algorithm and afterwards made my own implementation. Sorry I don't know about vcdiff.
Wikipedia: From a longest common subsequence it's only a small step to get diff-like output: if an item is absent in the subsequence but present in the original, it must have been deleted. (The '–' marks, below.) If it is absent in the subsequence but present in the second sequence, it must have been added in. (The '+' marks.)
Nice animation of the LCS algorithm here.
Link to a fast LCS ruby implementation here.
My slow and simple ruby adaptation is below.
def lcs(xs, ys)
if xs.count > 0 and ys.count > 0
xe, *xb = xs
ye, *yb = ys
if xe == ye
return [xe] + lcs(xb, yb)
end
a = lcs(xs, yb)
b = lcs(xb, ys)
return (a.length > b.length) ? a : b
end
return []
end
def find_diffs(original, modified, subsequence)
result = []
while subsequence.length > 0
sfirst, *subsequence = subsequence
while modified.length > 0
mfirst, *modified = modified
break if mfirst == sfirst
result << "+#{mfirst}"
end
while original.length > 0
ofirst, *original = original
break if ofirst == sfirst
result << "-#{ofirst}"
end
result << "#{sfirst}"
end
while modified.length > 0
mfirst, *modified = modified
result << "+#{mfirst}"
end
while original.length > 0
ofirst, *original = original
result << "-#{ofirst}"
end
return result
end
def pretty_diff(original, modified)
subsequence = lcs(modified, original)
diffs = find_diffs(original, modified, subsequence)
puts 'ORIG [' + original.join(', ') + ']'
puts 'MODIFIED [' + modified.join(', ') + ']'
puts 'LCS [' + subsequence.join(', ') + ']'
puts 'DIFFS [' + diffs.join(', ') + ']'
end
pretty_diff("human".scan(/./), "chimpanzee".scan(/./))
# ORIG [h, u, m, a, n]
# MODIFIED [c, h, i, m, p, a, n, z, e, e]
# LCS [h, m, a, n]
# DIFFS [+c, h, +i, -u, m, +p, a, n, +z, +e, +e]

Based on the link Emmelaich gave, there is also a great run down of Diff Strategies on Neil Fraser's website (one of the authors of the library).
He covers basic strategies and towards the end of the article progresses to Myer's algorithm and some graph theory.

Related

Algorithm design manual solution to 1-8

I'm currently reading through The Algorithm Design Manual by Steven S. Skiena. Some of the concepts in the book I haven't used in almost 7 years. Even while I was in college it was difficult for me to understand how some of my classmates came up with some of these proofs. Now, I'm completely stuck on one of the exercises. Please help.
Will you please answer this question and explain how you came up with what to use for your Base case and why each step proves why it is valid and correct. I know this might be asking a lot, but I really need help understanding how to do these.
Thank you in advance!
Proofs of Correctness
Question:
1-8. Proove the correctness of the following algorithm for evaluating a polynomial.
$$P(x) = a_nx_n+a_n−1x_n−1+⋯+a_1x+a_0$$
&function horner(A,x)
p=A_n
for i from n−1 to 0
p=p∗x+Ai
return p$
btw, off topic: Sorry guys, I'm not sure how to correctly add the mathematical formatting for the formula. I tried by addign '$' around each section. Not sure why that isn't working.

https://cs.stackexchange.com/ is probably better for this. Also I'm pretty sure that $$ formatting only works on some StackExchange sites. But anyways, think about what this algorithm is doing at each step.
We start with p = A_n.
Then we take p = p*x + A_{n-1}. So what is this doing? We now have p = x*A_n + A_{n-1}.
I'll try one more step. p = p*x + A_{n-2} so now p = (x^2)*A_n + x*A_{n-1} + A{n-2} (here x^2 means x to the power 2, of course).
You should be able to take it from here.

Hashing algorithms for data summary

I am on the search for a non-cryptographic hashing algorithm with a given set of properties, but I do not know how to describe it in Google-able terms.
Problem space: I have a vector of 64-bit integers which are mostly linearlly distributed throughout that space. There are two exceptions to this rule: (1) The number 0 occurs considerably frequently and (2) if a number x occurs, it is more likely to occur again than 2^-64. The goal is, given two vectors A and B, to have a convenient mechanism for quickly detecting if A and B are not the same. Not all vectors are of fixed size, but any vector I wish to compare to another will have the same size (aka: a size check is trivial).
The only special requirement I have is I would like the ability to "back out" a piece of data. In other words, given A[i] = x and a hash(A), it should be cheap to compute hash(A) for A[i] = y. In other words, I want a non-cryptographic hash.
The most reasonable thing I have come up with is this (in Python-ish):
# Imagine this uses a Mersenne Twister or some other seeded RNG...
NUMS = generate_numbers(seed)
def hash(a):
out = 0
for idx in range(len(a)):
out ^= a[idx] ^ NUMS[idx]
return out
def hash_replace(orig_hash, idx, orig_val, new_val):
return orig_hash ^ (orig_val ^ NUMS[idx]) ^ (new_val ^ NUMS[idx])
It is an exceedingly simple algorithm and it probably works okay. However, all my experience with writing hashing algorithms tells me somebody else has already solved this problem in a better way.

I think what you are looking for is called homomorphic hashing algorithm and it has already been discussed Paillier cryptosystem.
As far as I can see from that discussion, there are no practical implementation nowadays.
The most interesting feature, the one for which I guess it fits your needs, is that:
H(x*y) = H(x)*H(y)
Because of that, you can freely define the lower limit of your unit and rely on that property.
I've used the Paillier cryptosystem a few years ago (there was a Java implementation somewhere, but I don't have anymore the link) during my studies, but it's far more complex in respect of what you are looking for.
It has interesting feature under certain constraints, like the following one:
n*C(x) = C(n*x)
Again, it looks to me similar to what you are looking for, so maybe you should search for this family of hashing algorithms. I'll have a try with Google searching for a more specific link.
References:
This one is quite interesting, but maybe it is not a viable solution because of your space that is [0-2^64[ (unless you accept to deal with big numbers).

Diffing many texts against each other to derive template and data (finding common subsequences)

Suppose there are many texts that are known to be made from a single template (for example, many HTML pages, rendered from a template backed by data from some sort of database). A very simple example:
id:937 name=alice;
id:28 name=bob;
id:925931 name=charlie;
Given only these 3 texts, I'd like to get original template that looks like this:
"id:" + $1 + " name=" + $2 + ";"
and 3 sets of strings that were used with this template:
$1 = 937, $2 = alice
$1 = 28, $2 = bob
$1 = 925931, $3 = charlie
In other words, "template" is a list of the common subsequences encountered in all given texts always in a certain order and everything else except these subsequences should be considered "data".
I guess the general algorithm would be very similar to any LCS (longest common subsequence) algorithm, albeit with different backtracking code, that would somehow separate "template" (characters common for all given texts) and "data strings" (different characters).
Bonus question: are there ready-made solutions to do so?

I agree with the comments about the question being ill-defined. It seems likely that the format is much more specific than your general question indicates.
Having said that, something like RecordBreaker might be a help. You could also Google "wrapper induction" to see if you find some useful leads.

Perform a global multiple sequence alignment, and then call every resulting column that has a constant value part of the template:
id: 937 name=alice ;
id: 28 name=bob ;
id:925931 name=charlie;
Inferred template: XXX XXXXXX X
Most tools that I'm aware of for multiple sequence alignment require smaller alphabets -- DNA or protein -- but hopefully you can find a tool that works on the alphabet you're using (which presumably is at least all printable ASCII characters). In the worst case, you can of course implement the DP yourself: to align 2 sequences (strings) globally you use the Needleman-Wunsch algorithm, while for more than two sequences there are several approaches, the most common being sum-of-pairs scoring. The exact algorithm for k > 2 sequences unfortunately takes time exponential in k, but the heuristics employed in bioinformatics tools such as MUSCLE are much faster, and produce alignments that are very nearly as good. If they can be persuaded to work with the alphabet you're using, they would be the natural choice.

Iterative solving for unknowns in a fluids problem

I am a Mechanical engineer with a computer scientist question. This is an example of what the equations I'm working with are like:
x = √((y-z)×2/r)
z = f×(L/D)×(x/2g)
f = something crazy with x in it
etc…(there are more equations with x in it)
The situation is this:
I need r to find x, but I need x to find z. I also need x to find f which is a part of finding z. So I guess a value for x, and then I use that value to find r and f. Then I go back and use the value I found for r and f to find x. I keep doing this until the guess and the calculated are the same.
My question is:
How do I get the computer to do this? I've been using mathcad, but an example in another language like C++ is fine.

The very first thing you should do faced with iterative algorithms is write down on paper the sequence that will result from your idea:
Eg.:
x_0 = ..., f_0 = ..., r_0 = ...
x_1 = ..., f_1 = ..., r_1 = ...
...
x_n = ..., f_n = ..., r_n = ...
Now, you have an idea of what you should implement (even if you don't know how). If you don't manage to find a closed form expression for one of the x_i, r_i or whatever_i, you will need to solve one dimensional equations numerically. This will imply more work.
Now, for the implementation part, if you never wrote a program, you should seriously ask someone live who can help you (or hire an intern and have him write the code). We cannot help you beginning from scratch with, eg. C programming, but we are willing to help you with specific problems which should arise when you write the program.
Please note that your algorithm is not guaranteed to converge, even if you strongly think there is a unique solution. Solving non linear equations is a difficult subject.

It appears that mathcad has many abstractions for iterative algorithms without the need to actually implement them directly using a "lower level" language. Perhaps this question is better suited for the mathcad forums at:
http://communities.ptc.com/index.jspa

If you are using Mathcad, it has the functionality built in. It is called solve block.
Start with the keyword "given"
Given
define the guess values for all unknowns
x:=2
f:=3
r:=2
...
define your constraints
x = √((y-z)×2/r)
z = f×(L/D)×(x/2g)
f = something crazy with x in it
etc…(there are more equations with x in it)
calculate the solution
find(x, y, z, r, ...)=
Check Mathcad help or Quicksheets for examples of the exact syntax.

The simple answer to your question is this pseudo-code:
X = startingX;
lastF = Infinity;
F = 0;
tolerance = 1e-10;
while ((lastF - F)^2 > tolerance)
{
lastF = F;
X = ?;
R = ?;
F = FunctionOf(X,R);
}
This may not do what you expect at all. It may give a valid but nonsense answer or it may loop endlessly between alternate wrong answers.
This is standard substitution to convergence. There are more advanced techniques like DIIS but I'm not sure you want to go there. I found this article while figuring out if I want to go there.
In general, it really pays to think about how you can transform your problem into an easier problem.
In my experience it is better to pose your problem as a univariate bounded root-finding problem and use Brent's Method if you can
Next worst option is multivariate minimization with something like BFGS.
Iterative solutions are horrible, but are more easily solved once you think of them as X2 = f(X1) where X is the input vector and you're trying to reduce the difference between X1 and X2.

As the commenters have noted, the mathematical aspects of your question are beyond the scope of the help you can expect here, and are even beyond the help you could be offered based on the detail you posted.
However, I think that even if you understood the mathematics thoroughly there are computer science aspects to your question that should be addressed.
When you write your code, try to make organize it into functions that depend only upon the parameters you are passing in to a subroutine. So write a subroutine that takes in values for y, z, and r and returns you x. Make another that takes in f,L,D,G and returns z. Now you have testable routines that you can check to make sure they are computing correctly. Check the input values to your routines in the routines - for instance in computing x you will get a divide by 0 error if you pass in a 0 for r. Think about how you want to handle this.
If you are going to solve this problem interatively you will need a method that will decide, based on the results of one iteration, what the values for the next iteration will be. This also should be encapsulated within a subroutine. Now if you are using a language that allows only one value to be returned from a subroutine (which is most common computation languages C, C++, Java, C#) you need to package up all your variables into some kind of data structure to return them. You could use an array of reals or doubles, but it would be nicer to choose to make an object and then you can reference the variables by their name and not their position (less chance of error).
Another aspect of iteration is knowing when to stop. Certainly you'll do so when you get a solution that converges. Make this decision into another subroutine. Now when you need to change the convergence criteria there is only one place in the code to go to. But you need to consider other reasons for stopping - what do you do if your solution starts diverging instead of converging? How many iterations will you allow the run to go before giving up?
Another aspect of iteration of a computer is round-off error. Mathematically 10^40/10^38 is 100. Mathematically 10^20 + 1 > 10^20. These statements are not true in most computations. Your calculations may need to take this into account or you will end up with numbers that are garbage. This is an example of a cross-cutting concern that does not lend itself to encapsulation in a subroutine.
I would suggest that you go look at the Python language, and the pythonxy.com extensions. There are people in the associated forums that would be a good resource for helping you learn how to do iterative solving of a system of equations.

Standards for pseudo code? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I need to translate some python and java routines into pseudo code for my master thesis but have trouble coming up with a syntax/style that is:
consistent
easy to understand
not too verbose
not too close to natural language
not too close to some concrete programming language.
How do you write pseudo code? Are there any standard recommendations?

I recommend looking at the "Introduction to Algorithms" book (by Cormen, Leiserson and Rivest). I've always found its pseudo-code description of algorithms very clear and consistent.
An example:
DIJKSTRA(G, w, s)
1 INITIALIZE-SINGLE-SOURCE(G, s)
2 S ← Ø
3 Q ← V[G]
4 while Q ≠ Ø
5 do u ← EXTRACT-MIN(Q)
6 S ← S ∪{u}
7 for each vertex v ∈ Adj[u]
8 do RELAX(u, v, w)

Answering my own question, I just wanted to draw attention to the TeX FAQ entry Typesetting pseudocode in LaTeX. It describes a number of different styles, listing advantages and drawbacks. Incidentally, there happen to exist two stylesheets for writing pseudo code in the manner used in "Introductin to Algorithms" by Cormen, as recommended above: newalg and clrscode. The latter was written by Cormen himself.

I suggest you take a look at the Fortress Programming Language.
This is an actual programming language, and not pseudocode, but it was designed to be as close to executable pseudocode as possible. In particular, for designing the syntax, they read and analyzed hundreds of CS and math papers, courses, books and journals to find common usage patterns for pseudocode and other computational/mathematical notations.
You can leverage all that research by just looking at Fortress source code and abstracting out the things you don't need, since your target audience is human, whereas Fortress's is a compiler.
Here is an actual example of running Fortress code from the NAS (NASA Advanced Supercomputing) Conjugate Gradient Parallel Benchmark. For a fun experience, compare the specification of the benchmark with the implementation in Fortress and notice how there is almost a 1:1 correspondence. Also compare the implementation in a couple of other languages, like C or Fortran, and notice how they have absolutely nothing to do with the specification (and are also often an order of magnitude longer than the spec).
I must stress: this is not pseudocode, this is actual working Fortress code! From https://umbilicus.wordpress.com/2009/10/16/fortress-parallel-by-default/
Note that Fortress is written in ASCII characters; the special characters are rendered with a formatter.

If the code is procedural, normal pseudo-code is probably easy (Wikipedia has some examples).
Object-oriented pseudo-code might be more difficult. Consider:
using UML class diagrams to depict the classes/inheritence
using UML sequence diagrams to depict the sequence of code

I don't understand your requirement of "not too close to some concrete programming language".
Python is generally considered as a good candidate for writing pseudo-code. Perhaps a slightly simplified version of python would work for you.

Pascal has always been traditionally the most similar to pseudocode, when it comes to mathematical and technical fields. I don't know why, it was just always so.
I have some (oh, I don't know, 10 maybe books on a shelf, which concrete this theory).
Python as suggested, can be nice code, but it can be so unreadable as well, that it's a wonder by itself. Older languages are harder to make unreadable - them being "simpler" (take with caution) than today's ones. They'll maybe be harder to understand what's going on, but easier to read (less syntax/language features is needed for to understand what the program does).

This post is old, but hopefully this will help others.
"Introduction to Algorithms" book (by Cormen, Leiserson and Rivest) is a good book to read about algorithms, but the "pseudo-code" is terrible. Things like Q[1...n] is nonsense when one needs to understand what Q[1...n] is suppose to mean. Which will have to be noted outside of the "pseudo-code." Moreover, books like "Introduction to Algorithms" like to use a mathematical syntax, which is violating one purpose of pseudo-code.
Pseudo-code should do two things. Abstract away from syntax and be easy to read. If actual code is more descriptive than the pseudo-code, and actual code is more descriptive, then it is not pseudo-code.
Say you were writing a simple program.
Screen design:
Welcome to the Consumer Discount Program!
Please enter the customers subtotal: 9999.99
The customer receives a 10 percent discount
The customer receives a 20 percent discount
The customer does not receive a discount
The customer's total is: 9999.99
Variable List:
TOTAL: double
SUB_TOTAL: double
DISCOUNT: double
Pseudo-code:
DISCOUNT_PROGRAM
Print "Welcome to the Consumer Discount Program!"
Print "Please enter the customers subtotal:"
Input SUB_TOTAL
Select the case for SUB_TOTAL
SUB_TOTAL > 10000 AND SUB_TOTAL <= 50000
DISCOUNT = 0.1
Print "The customer receives a 10 percent discount"
SUB_TOTAL > 50000
DISCOUNT = 0.2
Print "The customer receives a 20 percent discount"
Otherwise
DISCOUNT = 0
Print "The customer does not a receive a discount"
TOTAL = SUB_TOTAL - (SUB_TOTAL * DISCOUNT)
Print "The customer's total is:", TOTAL
Notice that this is very easy to read and does not reference any syntax. This supports all three of Bohm and Jacopini's control structures.
Sequence:
Print "Some stuff"
VALUE = 2 + 1
SOME_FUNCTION(SOME_VARIABLE)
Selection:
if condition
Do one extra thing
if condition
do one extra thing
else
do one extra thing
if condition
do one extra thing
else if condition
do one extra thing
else
do one extra thing
Select the case for SYSTEM_NAME
condition 1
statement 1
condition 2
statement 2
condition 3
statement 3
otherwise
statement 4
Repetition:
while condition
do stuff
for SOME_VALUE TO ANOTHER_VALUE
do stuff
compare that to this N-Queens "pseudo-code" (https://en.wikipedia.org/wiki/Eight_queens_puzzle):
PlaceQueens(Q[1 .. n],r)
if r = n + 1
print Q
else
for j ← 1 to n
legal ← True
for i ← 1 to r − 1
if (Q[i] = j) or (Q[i] = j + r − i) or (Q[i] = j − r + i)
legal ← False
if legal
Q[r] ← j
PlaceQueens(Q[1 .. n],r + 1)
If you can't explain it simply, you don't understand it well enough.
- Albert Einstein

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Diff Algorithm? [closed] - algorithm

Based on the link Emmelaich gave, there is also a great run down of Diff Strategies on Neil Fraser's website (one of the authors of the library). He covers basic strategies and towards the end of the article progresses to Myer's algorithm and some graph theory.

Related

Algorithm design manual solution to 1-8

Hashing algorithms for data summary

Diffing many texts against each other to derive template and data (finding common subsequences)

Iterative solving for unknowns in a fluids problem

Standards for pseudo code? [closed]

Categories

Resources