Detect duplicated/similar text among large datasets?

Detect duplicated/similar text among large datasets? - algorithm

I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem?
We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes

http://d3s.mff.cuni.cz/~holub/sw/shash/
http://matpalm.com/resemblance/simhash/

This answer is of a very high complexity class (worst case it's quintic, expected case it's quartic to verify your database the first time, then quartic/cubic to add a record,) so it doesn't scale well, unfortunately there isn't a much better answer that I can think of right now.
The algorithm is called the Ratcliff-Obershelp algorithm, It's implemented in python's difflib. The algorithm itself is cubic time worst case and quadratic expected. Then you have to do that for each possible pair of records, which is quadratic. When adding a record, of course, this is only linear.
EDIT: Sorry, I misread the documentation, difflib is quadratic only, rather than cubic. Use it rather than the other algorithm.

Look at shngle-min-hash techniques. Here is a presentation that could help you.

One approach I have used to do something similar is to construct a search index in the usual based on word statistics and then use the new item as if it was a search against that index - if the score for the top item in the search is too high then the new item is too similar. No doubt some of the standard text search libraries could be used for this although if it is only a few thousands of records it is pretty trivial to build your own.

Related

What is the difference between Genetic Algorithm and Iterated Local Search Algorithm?

I'm basically trying to use Genetic Algorithm or Iterated Local Search Algorithm to get an optimal solution for a question.Can someone please explain what is the basic difference between these two algorithms and is there any situations where one of them is better than the other?

Let me start from the second question. I believe that there is no way to determine a better algorithm for a given problem without any trials and tests. The behavior of an algorithm heavily depends on problem's properties. If we are talking about complex problems with hundreds and thousands of variables, it's just too difficult to predict anything. I'm not talking about your engineer's intuition, some deep problem understanding, previous experience, etc, they are not really measurable.
The main difference between global and local search is quite straightforward - local search considers just one or a few of possible solutions at a single point of time and it tries to improve them with some modifications. Thus, each iteration it considers just a small portion of a search space (=local neighboorhood). Global search tries to take into account whole problem with all its parameters at the same time. For example, PSO samples huge amount of candidates and tries to move all of them into the global optimum's direction using some simple formula.

top 3 word count in a text editor

I know one way to solve this question is to Hash the words and its corresponding word count. Then traverse the Hash map and figure out the top 3.
Is there any better way to solve this ? Will it be better if I use a BST instead of a HashMap ?

A Trie is a good datastructure for this. No need for hash calculations and its time complexity for inserts and updates is O(1) in the size of the dictionary.

Basically a histogram is the standard way of doing so, have your pick of which implementation you want to use for the histogram interface, the difference between them is actually instance specific - each has its advantages and disadvantages.
You might also want to consider a map-reduce design to get the words count:
map(doc):
for each word:
emitIntermediate(word,"1")
reduce(word,list<string>):
emit(word,size(list))
This approach allows great scalability if you have a lot of documents - using the map-reduce interface, or elegant solution if you like functional programming.
Note: this approach is basically same as the hash solution, since the mapper is passing the (key,values) tuple using hashing.

Either a HashMap or a BST are a reasonable choice. Performance of each will vary depending upon the number of words you need to count over. A profiler is your friend in these instances (VisualVM is a reasonable choice to start with).

I would wager that a hash table would have better performance in this case since there are likely many different words. Lookups will take O(1) over O(log N).

Is there a hashing algorithm that is tolerant of minor differences?

I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page.
Are there any hashing algorithms that work for something like this?

A common way to do document similarity is shingling, which is somewhat more involved than hashing. Also look into content defined chunking for a way to split up the document.
I read a paper a few years back about using Bloom filters for similarity detection. Using Bloom Filters to Refine Web Search Results. It's an interesting idea, but I never got around to experimenting with it.

This might be a good place to use the Levenshtein distance metric, which quantifies the amount of editing required to transform one sequence into another.
The drawback of this approach is that you'd need to keep the full text of each page so that you could compare them later. With a hash-based approach, on the other hand, you simply store some sort of small computed value and don't require the previous full text for comparison.
You also might try some sort of hybrid approach--let a hashing algorithm tell you that any change has been made, and use it as a trigger to retrieve an archival copy of the document for more rigorous (Levenshtein) comparison.

http://www.phash.org/ did something like this for images. The jist: Take an image, blur it, convert it to greyscale, do a discrete cosine transform, and look at just the upper left quadrant of the result (where the important information is). Then record a 0 for each value less than the average and 1 for each value more than the average. The result is pretty good for small changes.
Min-Hashing is another possibility. Find features in your text and record them as a value. Concatenate all those values to make a hash string.
For both of the above, use a vantage point tree so that you can search for near-hits.

I am sorry to say, but hash algorithms are precisely. Theres none capable of be tolerant of minor differences. You should take another approach.

String search algorithms

I have a project on benchmarking String Matching Algorithms and I would like to know if there is a standard for every algorithm so that I would be able to get fair results with my experimentation. I am planning to use java's system.nanotime in getting the running time of every algorithm. Any comment or reactions regarding my problem is very much appreciated. Thanks!

I am not entirely sure what you're asking. However, I am guessing you are asking how to get the most realistic results. You need to run your algorithm hundreds, or even thousands of iterations to get an average. It is also very important to turn off any caching that your language may do, and don't reuse objects, unless it is part of your algorithm.

I am not entirely sure what you're asking. However, another interpretation of what you are asking can be answered by trying to work out how a given algorithm performs as you increase the size of the problem. Using raw time to compare algorithms at a given string size does not necessarily allow for accurate comparison. Instead, you could try each algorithm with different string sizes and see how the algorithm behaves as string size varies.
And Mark's advice is good too. So you are running repeated trials for many different string lengths to get a picture of how one algorithm works, then repeating that for the next algorithm.

Again, it's not clear what you're asking, but here's another thought in addition to what Tony and Mark said:
Be very careful about testing only "real" input or only "random" input. Some algorithms are tuned to do well on typical input (searching for a word in English text), while others are tuned for working well on pathologically hard cases. You'll need a huge mix of possible inputs of all different types and sizes to do a truly good benchmark.

Is there any reason to implement my own sorting algorithm?

Sorting has been studied for decades, so surely the sorting algorithms provide by any programming platform (java, .NET, etc.) must be good by now, right? Is there any reason to override something like System.Collections.SortedList?

There are absolutely times where your intimate understanding of your data can result in much, much more efficient sorting algorithms than any general purpose algorithm available. I shared an example of such a situation in another post at SO, but I'll share it hear just to provide a case-in-point:
Back in the days of COBOL, FORTRAN, etc... a developer working for a phone company had to take a relatively large chunk of data that consisted of active phone numbers (I believe it was in the New York City area), and sort that list. The original implementation used a heap sort (these were 7 digit phone numbers, and a lot of disk swapping was taking place during the sort, so heap sort made sense).
Eventually, the developer stumbled on a different approach: By realizing that one, and only one of each phone number could exist in his data set, he realized that he didn't have to store the actual phone numbers themselves in memory. Instead, he treated the entire 7 digit phone number space as a very long bit array (at 8 phone numbers per byte, 10 million phone numbers requires just over a meg to capture the entire space). He then did a single pass through his source data, and set the bit for each phone number he found to 1. He then did a final pass through the bit array looking for high bits and output the sorted list of phone numbers.
This new algorithm was much, much faster (at least 1000x faster) than the heap sort algorithm, and consumed about the same amount of memory.
I would say that, in this case, it absolutely made sense for the developer to develop his own sorting algorithm.
If your application is all about sorting, and you really know your problem space, then it's quite possible for you to come up with an application specific algorithm that beats any general purpose algorithm.
However, if sorting is an ancillary part of your application, or you are just implementing a general purpose algorithm, chances are very, very good that some extremely smart university types have already provided an algorithm that is better than anything you will be able to come up with. Quick Sort is really hard to beat if you can hold things in memory, and heap sort is quite effective for massive data set ordering (although I personally prefer to use B+Tree type implementations for the heap b/c they are tuned to disk paging performance).

Generally no.
However, you know your data better than the people who wrote those sorting algorithms. Perhaps you could come up with an algorithm that is better than a generic algorithm for your specific set of data.

Implementing you own sorting algorithm is akin to optimization and as Sir Charles Antony Richard Hoare said, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil".

Certain libraries (such as Java's very own Collections.sort) implement a sort based on criteria that may or may not apply to you. For example, Collections.sort uses a merge sort for it's O(n log(n)) efficiency as well as the fact that it's an in-place sort. If two different elements have the same value, the first element in the original collection stays in front (good for multi-pass sorting to different criteria (first scan for date, then for name, the collection stays name (then date) sorted)) However, if you want slightly better constants or have a special data-set, it might make more sense to implement your own quick sort or radix sort specific exactly to what you want to do.
That said, all operations are fast on sufficiently small n

Short answer; no, except for academic interest.

You might want to multi-thread the sorting implementation.
You might need better performance characteristics than Quicksorts O(n log n), think bucketsort for example.
You might need a stable sort while the default algorithm uses quicksort. Especially for user interfaces you'll want to have the sorting order be consistent.
More efficient algorithms might be available for the data structures you're using.
You might need an iterative implementation of the default sorting algorithm because of stack overflows (eg. you're sorting large sets of data).
Ad infinitum.

A few months ago the Coding Horror blog reported on some platform with an atrociously bad sorting algorithm. If you have to use that platform then you sure do want to implement your own instead.

The problem of general purpose sorting has been researched to hell and back, so worrying about that outside of academic interest is pointless. However, most sorting isn't done on generalized input, and often you can use properties of the data to increase the speed of your sorting.
A common example is the counting sort. It is proven that for general purpose comparison sorting, O(n lg n) is the best that we can ever hope to do.
However, suppose that we know the range that the values to be sorted are in a fixed range, say [a,b]. If we create an array of size b - a + 1 (defaulting everything to zero), we can linearly scan the array, using this array to store the count of each element - resulting in a linear time sort (on the range of the data) - breaking the n lg n bound, but only because we are exploiting a special property of our data. For more detail, see here.
So yes, it is useful to write your own sorting algorithms. Pay attention to what you are sorting, and you will sometimes be able to come up with remarkable improvements.

If you have experience at implementing sorting algorithms and understand the way the data characteristics influence their performance, then you would already know the answer to your question. In other words, you would already know things like a QuickSort has pedestrian performance against an almost sorted list. :-) And that if you have your data in certain structures, some sorts of sorting are (almost) free. Etc.
Otherwise, no.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio