Find plagiarism in bulk articles [closed] - string-comparison

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have a 20,000 collection of master articles and I will get about 400,000 articles of one or two pages everyday. Now, I am trying to see if each one of this 400k articles are a copy or modified version of my collection of master articles (a threshold of above 60% plagiarism is fine with me)
What are the algorithms and technologies I should use to tackle the problem in a very efficient and timely manner.
Thanks

Fingerprint the articles (i.e. intelligently hash them based on the word frequency) and then look for statistical connection between the fingerprints. Then if there is a hunch on some of the data set, do a brute force search for matching strings on those.

Related

What's a good data structure to store and work with a thesaurus? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I've been working for a few years on an English-language thesaurus project, which combines a few sources (e.g. WordNet, Wiktionary Thesaurus, Moby Thesaurus, Word2vec) to make a large thesaurus. Currently I have the data defined as a list of lists. And each link has a score (higher = stronger), so "hotel" and "inn" might have a score of 2.0; but "hotel" and "fleabag" has a score of 0.2. High scores are near synonyms, low scores are more distant associations. I've been able to use Dijkstra and A* to find links between words (so-called "synonym chains").
Is there a type of graph database and/or analysis tools which is ideally suited for this sort of data? Word relationship strengths are often asymmetric. For example "Hoover Dam" links to "Herbert Hoover" more strongly than "Herbert Hoover" links back to "Hoover Dam". I'm interested in better ways to find the links between words, find unrelated words, measure word similarity.
I'd appreciate any new pointers/direction.
Interesting question. Not sure about the best data structure, but for processing, you can look at shell neighbors within this package: https://grispy.readthedocs.io/en/latest/api.html

Find duplicate images algorithm [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to create a program that find the duplicate images into a directory, something like this app does and I wonder what would be the algorithm to determine if two images are the same.
Any suggestion is welcome.
This task can be solved by perceptual-hashing, depending on your use-case, combined with some data-structure responsible for nearest-neighbor search in high-dimensions (kd-tree, ball-tree, ...) which can replace the brute-force search (somewhat).
There are tons of approaches for images: DCT-based, Wavelet-based, Statistics-based, Feature-based, CNNs (and more).
Their designs are usually based on different assumptions about the task, e.g. rotation allowed or not?
A google scholar search on perceptual image hashing will list a lot of papers. You can also look for the term image fingerprinting.
Here is some older ugly python/cython code doing the statistics-based approach.
Remark: Digikam can do that for you too. It's using some older Haar-wavelet based approach i think.

Data structure for dealing with millions of record [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Which data structure is appropriate for operating over millions of records and later need to iterate over it.
While simple linked list might be sufficient for your needs, in case you also need to be able to maintain records in sorted order, and efficiently access records or begin iteration at a arbitrary point, I would recommend looking in to using a B-tree.
In case you want to persist it to disk, you should use a key-value store, which often use B-tree's (or LSM Trees) "under the hood" as well as providing ACID guarantees. Examples include LMDB, BerkeleyDB, LevelDB
In short, use a database.

Searching through an list [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I'm reading about AI and in the notes it is mentioned
A lookup table in chess would have roughly 35^100 entries.
But what does this mean? Is there any way we could find out how long it would take the computer to search through and find it's entry? Would we assume thereis some order or that there is no order?
The number of atoms in the known universe is estimated to be around 10^80 which is much less than 35^100. With current technology, at least a few thousand atoms are required to store a single bit. I assume that each entry of your table would have multiple bits. You would need some really advanced technology to implement the memory of your computer.
So the answer is: With current technology it is not a matter of time, it is simply impossible.

How to calculate the relevance of two words or pharse? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I need a algorithm to calculate and measure the relevance of two words or phrase, e.g. "Apple" and "iPad".
Can anybody give me some hints or related books on such topics?
Thanks.
Have a look at mutual information and tf-idf. These are methods that are frequently used in information retrieval. The former quantifies the mutual dependence of two variables (each variable can be a phrase). The latter was traditionally used by search engines to prioritize results that were relevant to a particular query.

Resources