How to find largest difference between two text files? - git-diff

So I'm trying to debug two large execution logs with 500000 lines each, there are minor/irrelevant differences between them; but I only want to obtain the line with the biggest difference between them.
I've been using git diff, but I don't know what option I would use to do this.
If there is a better tool, please let me know of a suggestion.

Related

Efficient file ordering for byte diffiing?

I'm trying to find the 'best' way to order two lists of files so that a diff patch between them is small in general.
The way to do this without any other 'heuristics' that may fail easily (natural name order, parsing index files like cues to figured out natural sequential orders) seems to be to analyze the bytes on files on both collections, and figure out a sequence that minimizes the 'distance' between them.
This actually reminds me Levenshtein distance applied to segments of the bytes in the files (possibly with a constraint segments of the same file are in order to minimize permutations). Is there a library around that can figure out this for me? Notice that it's likely for the header or footer of files that are 'technically the same' to be different (ex: different dump format).
My main use case is to figure out the distance between two kinds of cd dumps. It's pretty normal for a cd dump to be segmented in different ways. I could just figure out their 'natural' order from the index files (cue, ccd etc) but why waste a opportunity to get something that applies generally (that works with extra files in the source or destination, or files segmented in different ways or to compare things that aren't cd dumps)?
I'd prefer a library in python if you know of any?
BTW I already have something implemented zxd3 but it's pretty much using the 'natural order' heuristic, i'd like to improve it (and make it work on more than two zips).

How Duplicate File search is implemented in Gemini For Mac os

I tried to search for Duplicate files in my mac machine via command line.
This process took almost half an hour for 10 gb Data files whereas Gemini and cleanmymac apps takes lesser time to find the files.
So my point here is how this fastness is achieved in these apps,what is the concept behind it?, in which language code is written.
I tried googling for information but didnot get anything related to duplicate finder.
if you have any ideas please input them here.
First of all Gemini locates files with equal size, than it uses it’s own hash-like type-dependent algorithm to compare files content. That algorithm is not 100% accurate but much more quick than classical hashes.
I contacted support, asking them what algorithm they use. Their response was that they compare parts of each file to each other, rather than the whole file or doing a hash. As a result, they can only check maybe 5% (or less) of each file that's reasonably similar in size to each other, and get a reasonably accurate result. Using this method, they don't have to pay the cost of comparing the whole file OR the cost of hashing files. They could be even more accurate, if they used this method for the initial comparison, and then did full comparisons among the potential matches.
Using this method, files that are minor variants of each other may be detected as identical. For example, I've had two songs (original mix and VIP mix) that counted as the same. I also had two images, one with a watermark and one without, listed as identical. In both these cases, the algorithm just happened to pick parts of the file that were identical across the two files.

OCR error correction: How to combine three erroneous results to reduce errors

The problem
I am trying to improve the result of an OCR process by combining the output from three different OCR systems (tesseract, cuneinform, ocrad).
I already do image preprocessing (deskewing, despeckling, threholding and some more). I don't think that this part can be improved much more.
Usually the text to recognize is between one and 6 words long. The lanuage of the text is unknown and quite often they contain fantasy words.
I am on Linux. Preferred language would be Python.
What I have so far
Often every result has one or two errors. But they have errors at different characters/positions. Errors could be that they recognize a wrong character or that they include a non existing character. Not so often they ignore a character.
An example might look in the following way:
Xorem_ipsum
lorXYm_ipsum
lorem_ipuX
A X is a wrong recognized character and an Y is a character which does not exist in the text. Spaces are replaced by "_" for better readibilty.
In cases like this I try to combine the different results.
Using repeatedly the "longest common substring" algorithm between the three pairs I am able to get the following structure for the given example
or m_ipsum
lor m_ip u
orem_ip u
But here I am stuck now. I am not able to combine those pieces to a result.
The questions
Do you have
an idea how to combine the different
common longest substrings?
Or do you have a better idea how to solve this problem?
It all depends on the OCR engines you are using as to the quality of the results you can expect to get. You may find that by choosing a higher quality OCR engine that gives you confidence levels and bounding boxes would give you much better raw results in the first place and then extra information that could be used to determine the correct result.
Using Linux will restrict the possible OCR engines available to you. Personally I would rate Tesseract as 6.5/10 compared to commercial OCR engines available under Windows.
http://www.abbyy.com/ocr_sdk_linux/overview/ - The SDK may not be cheap though.
http://irislinktest.iriscorporate.com/c2-1637-189/iDRS-14-------Recognition--Image-preprocessing--Document-formatting-and-more.aspx - Available for Linux
http://www.rerecognition.com/ - Is available as a Linux version. This engine is used by many other companies.
All of the engines above should give you confidence levels, bounding boxes and better results than Tesseract OCR.
https://launchpad.net/cuneiform-linux - Cuneiform, now open sourced and running under Linux. This is likely one of your three engnines you are using. If not you should probably look at adding it.
Also you may want to look at http://tev.fbk.eu/OCR/Products.html for more options.
Can you past a sample or two of typical images and the OCR results from the engines. There are other ways to improve OCR recognition but it would depend on the images.
Maybe repeat the "longest common substring" until all results are the same.
For your example, you would get the following in the next step:
or m_ip u
or m_ip u
or m_ip u
OR do the "longest common substring" algorithm with the first and second string and then again the result with the third string. So you get the same result or m_ip u more easy.
So you can assume that letters should be correct. Now look at the spaces. Before or there are two times l and once X, so choose l. Between or and m_ip there are two times e and once XY, so choose e. And so on.
I'm new to OCR, but until now I find out that those systems are build to work based on a dictionary of words rather than letter by letter. So, if your images doesn't have real words, maybe you will have to look closer to the letter recognition & training part of the systems you are using.
I afforded a very similar problem.
I hope that this can help: http://dl.tufts.edu/catalog/tufts:PB.001.011.00001
See also software developed by Bruce Robertson: https://github.com/brobertson/rigaudon

Algorithm that spans two pages in Latex

i have a long algorithm that i need to put in a report. i am using latex for this report. but due to the length of the algorithm it is more than one page but i cannot get it to fit into the next page. i am new to latex. can someone tell me how to do this? i am new to latex.
You should manually split the algorithm into two parts. You can just chop it in half, as redtuna suggested, or even better, you can factor out an interesting chunk into a new function and put that on a separate page. This will likely make the algorithm more readable too.
split it into two. If you're using one of the packages that lets you number lines, tell the second half to start with a line number that's one plus the last line number.
You'll probably be able to get better quality answers if you tell us which package you're using to format your algorithm (or ask for suggestions; I've had good results with "listings").

How to find unique entries in Large data set?

I have 100 million lines of data, the data is a word no longer than 15 chars,one word per line. Those data are stored in multiple files.
My goal to to find the unique words among all files.
One solution is to import all words into database and add a unique key for the field. but this is too slow for this large data set.
Is there any faster solution?
Thank you
I'm not sure that there'll be many faster ways than using a database. Personally, I usually use UNIX shell script for this:
cat * | sort | uniq
I don't know how fast that would be with 100,000,000 words, and I'm not sure how fast you want it to be. (E.g., do you need to run it lots of times or just once? If just once, I'd go with the sort and uniq option and let it run overnight if you can).
Alternatively, you could write a script in ruby or a similar language that stored the words in an associative array. I suspect that would almost certainly be slower than the database approach though.
I guess if you really want speed, and you need to carry out this task (or ones like it) often, then you might want to write something in C, but to me that feels a bit like overkill.
Ben
Using a database for this is insane. 100 million records of 15 chars fits in ram. If there is at least some duplication, simply build a trie. Should be able to process 50MB/second or so on a modern machine
If you have to stick with the file structure, then you need some way of indexing the files and then maintaining the index.
Otherwise, I would recommend moving to a database and migrating all operations on that file to work with the database.
You could store the words in a hashtable. Assuming there are quite a number of duplicates, the O(1) search time will be a big performance boost.
Read a line.
Search for the word in the hashtable.
If not found, add it to the table.
If you have this much data, then it needs to be in a SQL server. This is why SQL was designed in the first place. If you continue to use these files you will forever be stuck with performance issues.
Even if these files are modified from external programs (or via FTP) you need to create an import process to run nightly.
You can conserve speed, space, or your sanity. Pick any two.
Throwing it all into a database sacrificed both speed and space, as you found out. But it was easy.
If space is your main problem (memory, disk space) then partition the work. Filter all of the 1 character lines from the files and use one of the above solutions (sort, uniq). Repeat with the 2 character lines for each file. And so on. The unique solutions from each pass form your solution set.
If your main problem is speed, then read each file exactly once creating a hash table (dictionary, whatever) to look for duplicates. Depending on the hash implementation, this could eat up bucketloads of memory (or disk). But it'll be fast.
If you need to conserve speed and space, then consider blending the two techniques. But be prepared to sacrifice the third item.
If there's significant duplication within individual files, it may be quicker to do it file by file then merge the results. Something along the lines of:
{ for n in * ; do sort -u $n ; done } | sort -u
(I'm assuming GNU bash and GNU sort)
I think the choice of best solution will depend heavily on the distribution of duplicates and the number of separate files, though, which you haven't shared with us.
Given myhusky's clarification (plenty of dupes, 10~20 files), I'll definitely suggest this as a good solution. In particular, dense duplication will speed up sort -u versus sort|uniq

Resources