How can I merge overlapping text without overlapping content - data-structures

Suppose you are sending snapshot of a document to a friend. If the document is long - you take two snapshots but some parts are being repeated. The end of snapshot1 and beginning of snapshot2 are same.
Assumption- The snapshots are the text data and we can compare line by line
Eg: Snapshot1 has 3 lines - [AAA,BBB,CCC] and snapshot2 has 4 lines- [BBB,CCC,DDD,EEE]
Final result - [AAA,BBB,CCC,DDD,EEE]
The brute force method is very obvious by comparing each element - Is there any data structure that we could use to reduce the complexity

Related

Searching for ordered series in a file

I have 2 file where each line contain an ordered series of id/value pairs of different sizes as follow
(id,value):(id,value):...
(1,2):(3,0):(60:3):....
Each line/series is considered a coverage point. File one is a list of all the coverage points that i need to find(this file could be big 5000000 lines +). This file should never change as it is a master list of all the points I need to cover. The second file is a run report that is generate by my program.
What I have to do now is write a script that first takes the report file and for every point line/coverage point I need to search to master file and see if there is an exact line that matches the coverage point. I need to find the line number so i can save the number of time I hit that coverage point.
First option I go trough each line in the report file and compare it to each line of the master file
Second option I have some way of initially sorting the master file so it is easier to search
Third option I make some sort of "hash function" that would take a line and give it a unique ID.
Fourth option I use some sort of data structure initially loaded with the master file
Linked list
Tree structure
Database
I think there is many ways of doing this but I want to do it as efficiently and without being to complex . if there are others I can't think about let me know. Any guidance at this point would be great
Here are a few points as example
(53,0):
(53,1):(54,0):(55,0):(56,1):(57,0):
(53,1):(54,0):(55,0):(56,1):(57,1):
(53,2):(54,0):(55,0):(56,1):(57,0):
(53,1):(54,0):(55,1):(59,1):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,0):
(53,1):(54,0):(55,0):(56,1):(57,2):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,0):
(53,2):(54,0):(55,2):(59,1):(59,0):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,1):
(53,2):(54,0):(55,0):(56,1):(57,1):
(53,1):(54,0):(55,1):(59,1):(60,2):
(53,1):(54,0):(55,1):(59,1):(60,1):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,1):
(53,1):(54,0):(55,2):(59,1):(59,0):(60,2):
(53,2):(54,0):(55,0):(56,1):(57,2):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,0):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,2):
(53,2):(54,0):(55,3):(59,1):(59,0):(59,1):(60,0):
(53,2):(54,0):(55,1):(59,1):(60,2):
(53,1):(54,0):(55,3):(59,1):(59,0):(59,1):(60,1):
(53,2):(54,0):(55,3):(59,1):(59,0):(59,1):(60,2):
(53,2):(54,0):(55,5):(59,1):(59,0):(59,1):(59,0):(59,1):(60,0):
(53,2):(54,0):(55,6):(59,1):(59,0):(59,1):(59,0):(59,1):(59,0):(60,0):
(53,1):(54,0):(55,5):(59,1):(59,0):(59,1):(59,0):(59,1):(60,1):
Let m= number of lines in the master-file
Let n= number of lines in the report-file
For 5000000 lines + it is most likly infeasable to load it to into memory. So you have to cope with slow file-IO.
Some Comments:
First option I go trough each line in the report file and compare it to each line of the master file
This would be extremely slow (O(m*n)) and all of it is very slow file-IO
If you can load the whole report-file into RAM you could do so and read the master-file line by line and search for it in the report-file (in ram)
line in
still O(m*n)) but you read both files only once
Second option I have some way of initially sorting the master file so it is easier to search
This would only be an advantage if you find a way to search in the file at random line-positions. [Without extra effort this is only possible if all lines have the exact same size]. So searching would speed up from O(m) to O(log M).
So overall performance would still be (O((log m)*n)) and all of it is very slow file-IO
Third option I make some sort of "hash function" that would take a line and give it a unique ID.
Implementing a hash-function working on files is possible but kind of tricky. You could split up the master-file in a lot of files. Each file includes all lines corresponding to the same hash-value (witch is also the file-name).
Fourth option I use some sort of data structure initially loaded with the master file
My Answer:
You cannot keep the whole master-file in RAM (most likely).
If your repot-file is much smaller, you can keep it in RAM.
1. Approach (quite simple):
Make a HashSet of the report-file in Ram and than iterate over the master-file (once) and check each line if it is in your HashSet.
Takes O(m) time [+ O(n) to create the HashSet]
2. Approach: split the master-file in smaller files matching their hash-value:
Load your report-file and store it in a HashSet. Than for each hasgh-value in our report-file-hash-set, we look in the corresponding master-file-hash-value-part to check for a match.
Takes O(n) time [+ O(n) to create the HashSet + O(m) to create the master-file-hash-files]

Kalman filter, multiple lines tracking

I have a problem with multiple lines tracking by using Kalman filter.
Input data - number of items and set of structures with x1,y1, x2,y2 (coordinates). For each iteration the number of items can be different so some lines can appear or disappear.
For single line it looks simple - we have input data, equasions etc. and we can create output. We always know the line can exist. If not, and it will appear later - it will be still the same line.
But for multiple lines I don't know how to start - in one iteration I can get few objects - ok, I will use this set of equasions for each of them. But in next iteration I can get less of more lines. I'm not sure what's the correct approach - I have data from previous iteration but I will need to use it for the same object. So:
1. I need to find it - checking distance between middlepoints for pair estimated previously <-> line N and choosing the smallest value? Is it correct approach or we have different method?
2. Storing old data - the line was visible long time but after next iteration will never appear. I got new line and again - the same situation. It will be good to store old results but in this case, after looong time, I will have a lot of zombie-data. Do we have some special criteria to clean it or I need to use some own ideas like max iterations if no detection etc.?

Find Successive words in Article B available from A

There are two articles, A and B, which are very large. Get three or more successive words in A and check if they appear in B, and count how many times they appear. For example, if 'book' 'his' and 'her' appear in A, how many times do they appear in B?
I thought about splitting the entire content of B and then checking all 3 words in A with StringToken, but I am not sure about the algorithm efficiency.
Look into what a Hashtable is, scan your file B for words one by one (you can split if you don't care about memory usage on large files) each word you find into the hashtable (when not found) or increment the number to get of times a word is seen.
Then you just scan. over A, looking for each set of 3 words, with a rolling sliding window. this way you can increase the length of the window later without rewriting anything.
for reference you should really tag homework questions as such.
It is obvious that you need to scan / parse entire content of B once to reach the results. You just cannot avoid doing that. Read it line by line. For every line, search for the given query terms and their counts in the line. Keep adding the counts generated on per line basis to get the final result.
If you want to do such computation many times on content of B for same/different terms, creating an Inverted_index for B would be best way.

need help designing for search algorithm in a more efficient way

I have a problem that involves biology area. Right now I have 4 VERY LARGE files(each with 0.1 billion lines), but the structure is rather simple, each line of these files has only 2 fields, both stands for a type of gene.
My goal is: design an efficient algorithm that can achieves the following:
Find a circle within the contents of these 4 files. The circle is defined as:
field #1 in a line in file 1 == field #1 in a line in file 2 and
field #2 in a line in file 2 == field #1 in a line in file 3 and
field #2 in a line in file 3 == field #1 in a line in file 4 and
field #2 in a line in file 4 == field #2 in a line in file 1
I cannot think of a decent way to solve this, so I just wrote a brute-force-stupid-4-layer-nested loop for now. I'm thinking about sorting them as alphabetical order, even if that might help a little, but then it's also obvious that the computer memory would not allow me to load everything at once. Can anybody tell me a good way to solve this problem in a both time and space efficient way? Thanks!!
First of all, I note that you can sort a file without holding it memory all at once, and that most operating systems have some program that does this, often called just "sort". Usually you can get it to sort on a field within a file, but if not you can rewrite each line to get it to sort the way you want.
Given this, you can connect two files by sorting them so that the first is sorted on field #1 and the second on field #2. You can then create one record for each match, combining all the fields, and only holding in memory a chunk from each file where all the fields you have sorted on have the same value. This will allow you to connect the result with another file - four such connections should solve your problem.
Depending on your data, the time it takes to solve your problem may depend on the order in which you make the connections. One rather naive way to make use of this is, at each stage, to take a small random sample from each file, and use this to see how many results will follow from each possible connection, and choose the connection that produces the fewest results. One way to take a random sample of N items from a large file is to take the first N lines in the file and then, when you have read in m lines so far, read the next line, and then with probability N/(m + 1) exchange one of the N lines held for it, else throw it away. Keep on until you have read through the whole file.
Here is one algorithm:
Select an appropriate lookup structure: If field#1 is an integer, Use bit-fields or an dictionary (or a set) if its an string; Use the a lookup structure for each file, i.e 4 in your case
Initialization phase: For each file: parse the file line by line and set the appropriate bit in bit-field or add the field to the dictionary in the corresponding lookup structure for the file.
After initializing the lookup structure above, check the condition in your question.
The complexity of this depends on the lookup structure implementation. For bit fields, it will be O(1) and for set or dictionary, it will be O(lg(n)), since they are usually implemented as a Balanced Search Tree. The complete complexity will be O(n) or O(n lg(n)); You solution in the question has complexity of O(n^4)
You can get the code and solution for bit fields from here
HTH
Here is one approach:
We will use the notation Fxy where x=field number , y=file_no
Sort each of the 4 files on the first fields.
For each field F11, find a match in file 2. This will be linear. Save these matches with all four fields to a new file. Now, use this file and use the corresponding field in this file and get all the matches from file3. Continue for file4 and back to file1.
In this way, as you progress to each new file, you are dealing with lesser number of lines. And since you have sorted the files, search in linear and can be done by reading from disk.
Here the complexity in O(n log n) for sorting, and O(m log n) for lookup, assuming m << n.
It's a bit easier to explain if your File 1 is the other way around (so each second element points to a first element in the next file).
Start with File 1, copy it to a new file writing each A, B pair as B, A, 'REV'
Append the contents of File 2 to it writing each A, B pair as A, B, 'FWD'
Sort the file
Process the file in chunks with the same initial value
Within that chunk group the lines into REV's and FWD's
Take the cartesian product of the revs and the fwds (nested loop)
Write a line with reverse(fwd) concat (rev) excluding the repeated token
e.g. B, A, 'REV' and B, C, 'FWD' -> C, B, A, 'REV'
Append the next file to this new output file (adding 'FWD' to each line)
Repeat from step 3
In essence you are building up a chain in reverse order and using a file-based sort algorithm to put sequences together that can be combined.
Of course it would be even easier to just read these files into a database and let it do the work ...

Seeking algo for text diff that detects and can group similar lines

I am in the process of writing a diff text tool to compare two similar source code files.
There are many such "diff" tools around, but mine shall be a little improved:
If it finds a set of lines are mismatched on both sides (ie. in both files), it shall not only highlight those lines but also highlight the individual changes in these lines (I call this inter-line comparison here).
An example of my somewhat working solution:
alt text http://files.tempel.org/tmp/diff_example.png
What it currently does is to take a set of mismatched lines and running their single chars thru the diff algo once more, producing the pink highlighting.
However, the second set of mismatches, containing "original 2", requires more work: Here, the first two right lines ("added line a/b") were added, while the third line is an altered version of the left side. I wish my software to detect this difference between a likely alteration and a probable new line.
When looking at this simple example, I can rather easily detect this case:
With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.
So far, so good. But I am still stuck with how to turn this into a more general algorithm for this purpose.
In a more complex situation, a set of different lines could have added lines on both sides, with a few closely matching lines in between. This gets quite complicated:
I'd have to match not only the first line on the left to the best on the right, but vice versa as well, and so on with all other lines. Basically, I have to match every line on the left against every one on the right. At worst, this might create even crossings, so that it's not easily clear any more which lines were newly inserted and which were just altered (Note: I do not want to deal with possible moved lines in such a block, unless that would actually simplify the algorithm).
Sure, this is never going to be perfect, but I'm trying to get it better than it's now. Any suggestions that aren't too theoerical but rather practical (I'm not good understanding abstract algos) are appreciated.
Update
I must admit that I do not even understand how the LCS algo works. I simply feed it two arrays of strings and out comes a list of which sequences do not match. I am basically using the code from here: http://www.incava.org/projects/java/java-diff
Looking at the code I find one function equal() that is responsible for telling the algorithm whether two lines match or not. Based on what Pavel suggested, I wonder if that's the place where I'd make the changes. But how? This function only returns a boolean - not a relative value that could identify the quality of the match. And I can not simply used a fixed Levenshtein ration that would decide whether a similar line is still considered equal or not - I'll need something that's self-adopting to the entire set of lines in question.
So, what I'm basically saying is that I still do not understand where I'd apply the fuzzy value that relates to the relative similarity of lines that do not (exactly) match.
Levenshtein distance is based on the notion of an "edit script" that transforms one string into another. It's very closely related to the Needleman-Wunsch algorithm used for aligning DNA sequences by inserting gap characters, in which we search for the alignment that maximises a score in O(nm) time using dynamic programming. Exact matches between characters increase the score, while mismatches or inserted gap characters reduce the score. An example alignment of AACTTGCCA and AATGCGAT:
AACTTGCCA-
AA-T-GCGAT
(6 matches, 1 mismatch, 3 gap characters, 3 gap regions)
We can think of the top string being the "starting" sequence that we are transforming into the "final" sequence on the bottom. Each column containing a - gap character on the bottom is a deletion, each column with a - on the top is an insertion, and each column with different (non-gap) characters is a substitution. There are 2 deletions, 1 insertion and 1 substitution in the above alignment, so the Levenshtein distance is 4.
Here is another alignment of the same strings, with the same Levenshtein distance:
AACTTGCCA-
AA--TGCGAT
(6 matches, 1 mismatch, 3 gap characters, 2 gap regions)
But notice that although there are the same number of gaps, there is one less gap region. Because biological processes are more likely to create wide gaps than multiple separate gaps, biologists prefer this alignment -- and so will the users of your program. This is accomplished by also penalising the number of gap regions in the scores that we compute. An O(nm) algorithm to accomplish this for strings of lengths n and m was given by Gotoh in 1982 in a paper called "An improved algorithm for matching biological sequences". Unfortunately, I can't find any links to free full text of the paper -- but there are many useful tutorials that you can find by googling "sequence alignment" and "affine gap penalty".
In general, different choices of match, mismatch, gap and gap region weights will give different alignments, but any negative score for gap regions will prefer the bottom alignment above to the top one.
What does all this have to do with your problem? If you use Gotoh's algorithm on individual characters with a suitable gap penalty (arrived at with a few empirical tests), you should find a significant decrease in the the number of terrible-looking alignments like the example you gave.
Efficiency Considerations
Ideally, you could just do this on characters and ignore lines altogether, since the affine penalty will work to cluster changes into blocks spanning many lines wherever it can. But because of the higher running time, it may be more realistic to do a first pass on lines and then rerun the algorithm on characters, using as input all lines that are not identical. Under this scheme, any shared block of identical lines can be handled by compressing it into a single "character" with inflated matching weight, which helps to ensure no "crossings" appear.
With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.
After you have determined it, use the same algorithm to determine what lines in these two chinks match each other. But you need to make slight modificaiton. When you used the algorithm to match equal lines, the lines could either match or not match, so that added either 0 or 1 to the cell of the table you used.
When comparing strings in one chunk some of them are "more equal" than others (ack. to Orwell). So they can add a real number from 0 to 1 to the cell when considering what sequence matches best so far.
To compute this metrics (from 0 to 1), you can apply to each pair of strings you encounter... right, the same algorithm again (actually, you already did this when you were doing the first pass of Levenstein algorithm). This will compute the length of LCS, whose ratio to the average length of two strings would be the the metric value.
Or, you can borrow the algorithm from one of diff tools. For instance, vimdiff can highlight the matches you require.
Here's one possible solution someone else just made me realize:
My original approach was like this:
Split the text up into separate lines and use LCS algo to determine where there are blocks of nonmatching lines.
Use some smart algo (which this question is about) to figure out which of these lines closely match, i.e. to tell that these lines were modified between revisions.
Compare those closely matching lines line-by-line using LCS again, while marking the non-matching lines as entirely new.
While this would allow for a better visual display of changes when comparing source code revisions, I now found that a much simpler approach is usually sufficient. It works like this:
Same as above.
Take the right and left block of nonmatching lines, concatenate those lines, and tokenize them (either into language-specific tokens/words, or just into single characters)
Apply the LCS algo on the two arrays of tokens.
Maybe those who replied to my original question assumed that I knew to do this all the time, but I had my focus so strongly on a per-line comparison that it did not occur to me to apply LCS on the set of lines by concatenating them, instead of processing them line-by-line.
So, while this approach will not provide as detailed change information as my original intent was, it still does improve the results over what I started yesterday with when I wrote this question.
I'll leave this question open for a while longer - maybe someone else, reading all this, can still provide a complete answer (Pavel and random_hacker offered some suggestions, but it's not a complete solution yet - anyway, thank you for the helpful comments).

Resources