Minimal Difference Patch Algorithm - algorithm

I'm trying to convey the difference between two bytestreams. I want to minimize the number of bytes in the patch.
(I don't necessarily want to minimize the number of "changes" in the diff, which is what the optimal patch in a levenshtein distance computation would give me.)
The patch would ideally be in a format such that, given the source bytestream and the diff, it would be easy to reconstruct the target bytestream.
Is there a good algorithm for doing this?
Edit: For the record, I've tried sending changes of the form "at spot 506, insert the following the bytes...", where I create a change list from the levenshtein distance algorithm.
The problem I have is that the levenshtein distance algorithm gives me a lot of changes like:
at spot 506 substitute [some bytes1]
at spot 507 do nothing
at spot 508 substitute [some bytes2]
at spot 509 do nothing
at spot 510 substitute [some bytes3]
...
This is because the lev distance algorithm tries to minimize the number of changes. However, for my purposes this instruction set is wasteful. It would probably be better if an algorithm just said,
At spot 506 substitute [some bytes1, [byte at spot 507], some bytes2, [byte at spot 509], some bytes3, ...]
There's probably some way to modify lev distance to favor these types of changes but it seems a little tricky. I could coalesce substituions after getting a changelist (and I'm going to try that) but there may be opportunities to coalesce deletions / inserts too, and it's less obvious how to do that correctly.
Just wondering if there's a special purpose algorithm for this (or if somebody's done a modification of lev distance to favor these types of changes already).

You can do this using pairwise alignment with affine gap costs, which takes O(nm) time for two strings of lengths n and m respectively.
One thing first: There is no way to find a provably minimal patch in terms of bits or bytes used. That's because if there was such a way, then the function shortest_patch(x, y) that calculates it could be used to find a provably minimal compression of any given string s by calling it with shortest_patch('', s), and Kolmogorov complexity tells us that the shortest possible compression of a given string is formally uncomputable. But if edits tend to be clustered in space, as it seems they are here, then it's certainly possible to find smaller patches than those produced using the usual Levenshtein distance algorithm.
Edit scripts
Patches are usually called "edit scripts" in CS. Finding a minimal (in terms of number of insertions plus number of deletions) edit script for turning one string x into another string y is equivalent to finding an optimal pairwise alignment in which every pair of equal characters has value 0, every pair of unequal characters has value -inf, and every position in which a character from one string is aligned with a - gap character has value -1. Alignments are easy to visualise:
st--ing st-i-ng
stro-ng str-ong
These are 2 optimal alignments of the strings sting and strong, each having cost -3 under the model. If pairs of unequal characters are given the value -1 instead of -inf, then we get an alignment with cost equal to the Levenshtein distance (the number of insertions, plus the number of deletions, plus the number of substitutions):
st-ing sti-ng
strong strong
These are 2 optimal alignments under the new model, and each has cost -2.
To see how these correspond with edit scripts, we can regard the top string as the "original" string, and the bottom string as the "target" string. Columns containing pairs of unequal characters correspond to substitutions, the columns containing a - in the top row correspond to insertions of characters, and the columns containing a - in the bottom row correspond to deletions of characters. You can create an edit script from an alignment by using the "instructions" (C)opy, (D)elete, (I)nsert and (S)ubstitute. Each instruction is followed by a number indicating the number of columns to consume from the alignment, and in the case of I and S, a corresponding number of characters to insert or replace with. For example, the edit scripts for the previous 2 alignments are
C2, I1"r", S1"o", C2 and C2, S1"r", I1"o", C2
Increasing bunching
Now if we have strings like mississippi and tip, we find that the two alignments
mississippi
------tip--
mississippi
t---i----p-
both have the same score of -9: they both require the same total number of insertions, deletions and substitutions. But we much prefer the top one, because its edit script can be described much more succinctly: D6, S1"t", C2, D2. The second's edit script would be S1"t", D3, C1, D4, C1, D1.
In order to get the alignment algorithm to also "prefer" the first alignment, we can adjust gap costs so that starting a blocks of gaps costs more than continuing an existing block of gaps. If we make it so that a column containing a gap costs -2 instead of -1 when the preceding column contains no gap, then what we are effectively doing is penalising the number of contiguous blocks of gaps (since each contiguous block of gaps must obviously have a first position). Under this model, the first alignment above now costs -11, because it contains two contiguous blocks of gaps. The second alignment now costs -12, because it contains three contiguous blocks of gaps. IOW, the algorithm now prefers the first alignment.
This model, in which every aligned position containing a gap costs g and the first position in any contiguous block of gap columns costs g + s, is called the affine gap cost model, and an O(nm) algorithm was given for this by Gotoh in 1982: http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user/archive/JMB82.pdf. Increasing the gap-open cost s will cause aligned segments to bunch together. You can play with the various cost parameters until you get alignments (corresponding to patches) that empirically look about right and are small enough.

There are two approaches to solving this kind of problem:
1) Establish a language for X (edit scripts, in this case), and figure out how to minimize the length of the applicable sentence; or,
2) Compute some kind of minimum representation for Y (string differences), and then think up a way to represent that in the shortest form.
The Myers paper demonstrates that for a particular language, finding the minimum set of changes and finding the minimum length of the change representation are the same problem.
Obviously, changing the language might invalidate that assumption, and certain changes might be extremely complicated to apply correctly (for example, suppose the language included the primitive kP which means to remove the next k characters whose indices are prime. For certain diffs, using that primitive might turn out to be a huge win, but the applications are probably pretty rare. It's an absurd example, I know, but it demonstrates the difficulty of starting with a language.
So I propose starting with the minimum change list, which identifies inserts and deletes. We translate that in a straightforward way to a string of commands, of which there are exactly three. There are no indices here. The idea is that we start with a cursor at the beginning of the original string, and then execute the commands in sequence. The commands are:
= Advance the cursor without altering the character it points to
Ic Insert the character `c` before the cursor.
D Delete the character at the cursor.
Although I said there were exactly three commands, that's not quite true; there are actually A+2 where A is the size of the alphabet.
This might result in a string like this:
=========================IbIaInIaInIaDD=D=D============================
Now, let's try to compress this. First, we run-length encode (RLE), so that every command is preceded by a repeat count, and we drop the trailing =s
27=1Ib1Ia1In1Ia1In1Ia2D1=1D1=1D
(In effect, the RLE recreates indices, although they're relative instead of absolute).
Finally, we use zlib to compress the resulting string. I'm not going to do that here, but just to give some idea of the sort of compression it might come up with:
27=1Ib1Ia1In||2D1=1D|
______+| ____+
___<---+
(Trying to show the back-references. It's not very good ascii art, sorry.)
Liv-Zempell is very good at finding and optimizing unexpected repetitions. In fact, we could have just used it instead of doing the intermediate RLE step, but experience shows that in cases where RLE is highly effective, it's better to LZ the RLE than the source. But it would be worth trying both ways to see what's better for your application.

A common approach to this that uses very few bytes (though not necessarily the theoretical optimal number of bytes) is the following:
Pad the bytes with some character (perhaps zero) until they have the same lengths.
XOR the two streams together. This will result in a byte stream that is zero everywhere the bytes are the same and nonzero otherwise.
Compress the XORed stream using any compression algorithm, perhaps something like LZW.
Assuming that the patch you have is a localized set of changes to a small part of the file, this will result in a very short patch, since the bulk of the file will be zeros, which can be efficiently compressed.
To apply the patch, you just decompress the XORed string and then XOR it with the byte stream to patch. This computes
Original XOR (Original XOR New) = (Original XOR Original) XOR New = New
Since XOR is associative and self-inverting.
Hope this helps!

There is a new promising approach to change detection.
The sequence alignment problem is considered to be an abstract model for changes detection in collaborative text editing designed to minimize the probability of merge conflict. A new cost function is defined as the probability of intersection between detected changes and random string.
The result should be more similar to patch length minimization then other known approaches.
It avoids both the known shortcomings of LCS and others approaches.
The cubic algorithm has been proposed.
http://psta.psiras.ru/read/psta2015_1_3-10.pdf

Related

Repeated DNA sequence

The problem is to find out all the sequences of length k in a given DNA sequence which occur more than once. I found a approach of using a rolling hash function, where for each sequence of length k, hash is computed and is stored in a map. To check if the current sequence is a repetition, we compute it's hash and check if the hash already exist in the hash map. If yes, then we include this sequence in our result, otherwise add it to the hash map.
Rolling hash here means, when moving on to the next sequence by sliding the window by one, we use the hash of previous sequence in a way that we remove the contribution of the first character of previous sequence and add the contribution of the newly added char i.e. the last character of the new sequence.
Input: AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT
and k=10
Answer: {AAAAACCCCC, CCCCCAAAAA}
This algorithm looks perfect, but I can't go about making a perfect hash function so that collisions are avoided. It would be a great help if somebody can explain how to make a perfect hash under any circumstance and most importantly in this case.
This is actually a research problem.
Let's come to terms with some facts
Input = N, Input length = |N|
You have to move a size k, here k=10, sliding window over the input. Therefore you must live with O(|N|) or more.
Your rolling hash is a form of locality sensitive deterministic hashing, the downside of deterministic hashing is the benefit of hashing is greatly diminished as the more often you encounter similar strings the harder it will be to hash
The longer your input the less effective hashing will be
Given these facts "rolling hashes" will soon fail. You cannot design a rolling hash that will even work for 1/10th of a chromosome.
SO what alternatives do you have?
Bloom Filters. They are much more robust than simple hashing. The downside is sometimes they have a false positives. But this can be mitigated by using several filters.
Cuckoo Hashes similar to bloom filters, but use less memory and have locality sensitive "hashing" and worst case constant lookup time
Just stick every suffix in a suffix trie. Once this is done, just output every string at depth 10 that also has atleast 2 children with one of the children being a leaf.
Improve on the suffix trie with a suffix tree. Lookup is not as straightforward but memory consumption is less.
My favorite the FM-Index. In my opinion the cleanest solution uses the Burrows Wheeler Transform. This technique is also used in industryu tools like Bowtie and BWA
Heads-up: This is not a general solution, but a good trick that you can use when k is not large.
The trick is to encrypt the sequence into an integer by bit manipulation.
If your input k is relatively small, let's say around 10. Then you can encrypt your DNA sequence in an int via bit manipulation. Since for each character in the sequence, there are only 4 possibilities, A, C, G, T. You can simply make your own mapping which uses 2 bits to represent a letter.
For example: 00 -> A, 01 -> C, 10 -> G, 11 -> T.
In this way, if k is 10, you won't need a string with 10 characters as hash key. Instead, you can only use 20 bits in an integer to represent the previous key string.
Then when you do your rolling hash, you left shift the integer that stores your previous sequence for 2 bits, then use any bit operations like |= to set the last two bits with your new character. And remember to clear the 2 left most bits that you just shifted, meaning you are removing them from your sliding window.
By doing this, a string could be stored in an integer, and using that integer as hash key might be nicer and cheaper in terms of the complexity of the hash function computation. If your input length k is slightly longer than 16, you may be able to use a long value. Otherwise, you might be able to use a bitset or a bitarray. But to hash them becomes another issue.
Therefore, I'd say this solution is a nice attempt for this problem when the sequence length is relatively small, i.e. can be stored in a single integer or long integer.
You can build the suffix array and the LCP array. Iterate through the LCP array, every time you see a value greater or equal to k, report the string referred to by that position (using the suffix array to determine where the substring comes from).
After you report a substring because the LCP was greater or equal to k, ignore all following values until reaching one that is less than k (this avoids reporting repeated values).
The construction of both, the suffix array and the LCP, can be done in linear time. So overall the solution is linear with respect to the size of the input plus output.
What you could do is use Chinese Remainder Theorem and pick several large prime moduli. If you recall, CRT means that a system of congruences with coprime moduli has a unique solution mod the product of all your moduli. So if you have three moduli 10^6+3, 10^6+33, and 10^6+37, then in effect you have a modulus of size 10^18 more or less. With a sufficiently large modulus, you can more or less disregard the idea of a collision happening at all---as my instructor so beautifully put it, it's more likely that your computer will spontaneously catch fire than a collision to happen, since you can drive that collision probability to be as arbitrarily small as you like.

Algorithm for computing transitive closures on DAGs by harnessing machine words?

Let G be DAG with n vertices and m edges given by adjacency matrix. I need to calculate it's closure in form of a matrix as well. We have a computer that each word is b bits. and I need to find an algorithm that calculate the transitive closure in (n^2+nm/b)
I'm not really sure I understand what bits means and how can I use it
Adding the algorithm for finding transitive closure of dag:
TransitiveForDAG (Graph G)
int T[1...n,1...n] ={0,...,0}
List L <- TopologicalSort(G)
For each v in reverse(L)
T[v,v]<-1
For each u in Adj[v]
for j<-1,...,n do
T[v,j]<-T[v,j] or T[u,j]
You say you don't know what bits mean, so let's start with that.
Bit is the smallest unit of digital information - a 0 or a 1
Word is the unit of data processed by a computer at once. Processors don't take and process individual bits, but small chunks of them. Most today's computer architectures use words of 32 or 64 bits.
Now, how to work with words of binary data? In most programming languages, you'll use a numeric data type to store the data. To manipulate them, most languages provide bitwise operators - bitwise or (|) is one needed here.
So, how to make your algorithm faster? Look at the T matrix. It can only have values of 0 or 1 - a single bit is enough to store that information. You're processing fields of the matrix one by one; every time you process the last line of your algorithm, you only use one bit from the v'th row and one from the u'th row.
As said before, the processor has to read a whole word to read and process each of those bits. That's ineffective; mostly you wouldn't care about such a detail, but here it's in a critical place - the last line is in the innermost cycle and will be executed very many times.
Now, how to be more effective? Change the way you store the data - use a type with the length of a word as the data type for your matrix. Store b values from your original matrix in every value of the new one - this is called packing. Because of the way your algorithm works, you'll want to store them by row - the first value in the i-th row will contain first b values from the i-th row of the original matrix.
Apart from that, you only have to change the innermost cycle of the algorithm - the cycle will iterate over the words instead of individual fields and inside you'll process the whole words at once using bitwise or
T[v,j]<-T[v,j] | T[u,j]
The cycle is what generates the time complexity of the algorithm. You've now managed to iterate b-times less, so the complexity becomes (n^2+nm/b)
For a simple graph, each entry in the adjacency matrix is either 0 or 1 and can be represented by one bit. This allows a compact representation of the matrix by packing b entries into each computer word. The challenge then is to implement matrix multiplication (to compute the closure) using the correct bit manipulation operators.

Get most unique text from a group of text

I have a number of texts, for example 100.
I would keep the 10 most unique among them. I made a 100x100 matrix where I compared each text among them with the Levenshtein algorithm.
Is there an algorithm to select the 10 most unique?
EDIT :
What i want is the N most unique text that maximize the distance between this N text regardless of the 1st element of my set.
I want the most unique because i will publish these text to the web and i want avoid near duplicate.
A long comment rather than an answer ...
I don't think you've specified your requirement(s) clearly enough. How do you select the 1st element of your set of 10 strings ? Is it the string with the largest distance from any other string (in which case you are looking for the largest element in your array) or the one with the largest distance from all the other strings (in which case you are looking for the largest row- or column-sum in the array).
Moving on to the N (or 10 as you suggest) most distant strings, you have a number of choices.
You could select the N largest distances in the array. I suspect, not having seen your data, that it is likely that the string which is furthest from any other string may also be furthest away from several other strings too -- I mean you may find that several of the N largest entries in your array occur in the same row or column.
You could simply select the N strings with the largest row sums.
Or perhaps you are looking for a cluster of N strings which maximises the distance between all the strings in that cluster and all the strings in the remaining 100-N strings. This might lead you towards looking at, rather obviously, clustering algorithms.
I suggest you clarify your requirements and edit your question.
Since this looks like an eigenvalue problem, I would try to execute the Power iteration on the matrix, and reject the 90 highest values from the resulting vector. The power iteration normally converges very fast, within ~ten iterations. BTW: this solution assumes a similarity matrix. If the entries of your matrix are a measure of *dis*similarity ("distance"), you might need to use their inverses instead.

Generate random sequence of integers differing by 1 bit without repeats

I need to generate a (pseudo) random sequence of N bit integers, where successive integers differ from the previous by only 1 bit, and the sequence never repeats. I know a Gray code will generate non-repeating sequences with only 1 bit difference, and an LFSR will generate non-repeating random-like sequences, but I'm not sure how to combine these ideas to produce what I want.
Practically, N will be very large, say 1000. I want to randomly sample this large space of 2^1000 integers, but I need to generate something like a random walk because the application in mind can only hop from one number to the next by flipping one bit.
Use any random number generator algorithm to generate an integer between 1 and N (or 0 to N-1 depending on the language). Use the result to determine the index of the bit to flip.
In order to satisfy randomness you will need to store previously generated numbers (thanks ShreevatsaR). Additionally, you may run into a scenario where no non-repeating answers are possible so this will require a backtracking algorithm as well.
This makes me think of fractals - following a boundary in a julia set or something along those lines.
If N is 1000, use a 2^500 x 2^500 fractal bitmap (obviously don't generate it in advance - you can derive each pixel on demand, and most won't be needed). Each pixel move is one pixel up, down, left or right following the boundary line between pixels, like a simple bitmap tracing algorithm. So long as you start at the edge of the bitmap, you should return to the edge of the bitmap sooner or later - following a specific "colour" boundary should always give a closed curve with no self-crossings, if you look at the unbounded version of that fractal.
The x and y axes of the bitmap will need "Gray coded" co-ordinates, of course - a bit like oversized Karnaugh maps. Each step in the tracing (one pixel up, down, left or right) equates to a single-bit change in one bitmap co-ordinate, and therefore in one bit of the resulting values in the random walk.
EDIT
I just realised there's a problem. The more wrinkly the boundary, the more likely you are in the tracing to hit a point where you have a choice of directions, such as...
* | .
---+---
. | *
Whichever direction you enter this point, you have a choice of three ways out. Choose the wrong one of the other two and you may return back to this point, therefore this is a possible self-crossing point and possible repeat. You can eliminate the continue-in-the-same-direction choice - whichever way you turn should keep the same boundary colours to the left and right of your boundary path as you trace - but this still leaves a choice of two directions.
I think the problem can be eliminated by making having at least three colours in the fractal, and by always keeping the same colour to one particular side (relative to the trace direction) of the boundary. There may be an "as long as the fractal isn't too wrinkly" proviso, though.
The last resort fix is to keep a record of points where this choice was available. If you return to the same point, backtrack and take the other alternative.
While an algorithm like this:
seed()
i = random(0, n)
repeat:
i ^= >> (i % bitlen)
yield i
…would return a random sequence of integers differing each by 1 bit, it would require a huge array for backtracing to ensure uniqueness of numbers.
Further more your running time would increase exponentially(?) with increasing density of your backtrace, as the chance to hit a new and non-repeating number decreases with every number in the sequence.
To reduce time and space one could try to incorporate one of these:
Bloom Filter
Use a Bloom Filter to drastically reduce the space (and time) needed for uniqueness-backtracing.
As Bloom Filters come with the drawback of producing false positives from time to time a certain rate of falsely detected repeats (sic!) (which thus are skipped) in your sequence would occur.
While the use of a Bloom Filter would reduce the space and time your running time would still increase exponentially(?)…
Hilbert Curve
A Hilbert Curve represents a non-repeating (kind of pseudo-random) walk on a quadratic plane (or in a cube) with each step being of length 1.
Using a Hilbert Curve (on an appropriate distribution of values) one might be able to get rid of the need for a backtrace entirely.
To enable your sequence to get a seed you'd generate n (n being the dimension of your plane/cube/hypercube) random numbers between 0 and s (s being the length of your plane's/cube's/hypercube's sides).
Not only would a Hilbert Curve remove the need for a backtrace, it would also make the sequencer run in O(1) per number (in contrast to the use of a backtrace, which would make your running time increase exponentially(?) over time…)
To seed your sequence you'd wrap-shift your n-dimensional distribution by random displacements in each of its n dimension.
Ps: You might get better answers here: CSTheory # StackExchange (or not, see comments)

How to find the best possible answer to a really large seeming problem?

First off, this is NOT a homework problem. I haven't had to do homework since 1988!
I have a list of words of length N
I have a max of 13 characters to choose from.
There can be multiples of the same letter
Given the list of words, which 13 characters would spell the most possible words. I can throw out words that make the problem harder to solve, for example:
speedometer has 4 e's in it, something MOST words don't have,
so I could toss that word due to a poor fit characteristic, or it might just
go away based on the algorithm
I've looked # letter distributions, I've built a graph of the words (letter by letter). There is something I'm missing, or this problem is a lot harder than I thought. I'd rather not totally brute force it if that is possible, but I'm down to about that point right now.
Genetic algorithms come to mind, but I've never tried them before....
Seems like I need a way to score each letter based upon its association with other letters in the words it is in....
It sounds like a hard combinatorial problem. You are given a dictionary D of words, and you can select N letters (possible with repeats) to cover / generate as many of the words in D as possible. I'm 99.9% certain it can be shown to be an NP-complete optimization problem in general (assuming possibly alphabet i.e. set of letters that contains more than 26 items) by reduction of SETCOVER to it, but I'm leaving the actual reduction as an exercise to the reader :)
Assuming it's hard, you have the usual routes:
branch and bound
stochastic search
approximation algorithms
Best I can come up with is branch and bound. Make an "intermediate state" data structure that consists of
Letters you've already used (with multiplicity)
Number of characters you still get to use
Letters still available
Words still in your list
Number of words still in your list (count of the previous set)
Number of words that are not possible in this state
Number of words that are already covered by your choice of letters
You'd start with
Empty set
13
{A, B, ..., Z}
Your whole list
N
0
0
Put that data structure into a queue.
At each step
Pop an item from the queue
Split into possible next states (branch)
Bound & delete extraneous possibilities
From a state, I'd generate possible next states as follows:
For each letter L in the set of letters left
Generate a new state where:
you've added L to the list of chosen letters
the least letter is L
so you remove anything less than L from the allowed letters
So, for example, if your left-over set is {W, X, Y, Z}, I'd generate one state with W added to my choice, {W, X, Y, Z} still possible, one with X as my choice, {X, Y, Z} still possible (but not W), one with Y as my choice and {Y, Z} still possible, and one with Z as my choice and {Z} still possible.
Do all the various accounting to figure out the new states.
Each state has at minimum "Number of words that are already covered by your choice of letters" words, and at maximum that number plus "Number of words still in your list." Of all the states, find the highest minimum, and delete any states with maximum higher than that.
No special handling for speedometer required.
I can't imagine this would be fast, but it'd work.
There are probably some optimizations (e.g., store each word in your list as an array of A-Z of number of occurrances, and combine words with the same structure: 2 occurrances of AB.....T => BAT and TAB). How you sort and keep track of minimum and maximum can also probably help things somewhat. Probably not enough to make an asymptotic difference, but maybe for a problem this big enough to make it run in a reasonable time instead of an extreme time.
Total brute forcing should work, although the implementation would become quite confusing.
Instead of throwing words like speedometer out, can't you generate the association graphs considering only if the character appears in the word or not (irrespective of the no. of times it appears as it should not have any bearing on the final best-choice of 13 characters). And this would also make it fractionally simpler than total brute force.
Comments welcome. :)
Removing the bounds on each parameter including alphabet size, there's an easy objective-preserving reduction from the maximum coverage problem, which is NP-hard and hard to approximate with a ratio better than (e - 1) / e ≈ 0.632 . It's fixed-parameter tractable in the alphabet size by brute force.
I agree with Nick Johnson's suggestion of brute force; at worst, there are only (13 + 26 - 1) choose (26 - 1) multisets, which is only about 5 billion. If you limit the multiplicity of each letter to what could ever be useful, this number gets a lot smaller. Even if it's too slow, you should be able to recycle the data structures.
I did not understand this completely "I have a max of 13 characters to choose from.". If you have a list of 1000 words, then did you mean you have to reduce that to just 13 chars?!
Some thoughts based on my (mis)understanding:
If you are only handling English lang words, then you can skip vowels because consonants are just as descriptive. Our brains can sort of fill in the vowels - a.k.a SMS/Twitter language :)
Perhaps for 1-3 letter words, stripping off vowels would loose too much info. But still:
spdmtr hs 4 's n t, smthng
MST wrds dn't hv, s cld
tss tht wrd d t pr ft
chrctrstc, r t mght jst g
wy bsd n th lgrthm
Stemming will cut words even shorter. Stemming first, then strip vowels. Then do a histogram....

Resources