Naive approaches to detecting plagiarism? [closed] - algorithm

Let's say you wanted to compare essays by students and see if one of those essays was plagiarized. How would you go about this in a naive manner (i.e. not too complicated approach)? Of course, there are simple ways like comparing the words used in the essays, and complicated ways like using compressing functions, but what are some other ways to check plagiarism without too much complexity/theory?

There are several papers giving several approaches, I recommend reading this
The paper shows an algorithm based on an index structure
built over the entire file collection.
So they say their algorithm can be used to find similar code fragments in a large software system. Before the index is built, all the files in the
collection are tokenized. This is a simple parsing problem, and can be solved in
linear time. For each of the N files in the collection, The output of the tokenizer
for a file F_i is a string of n_i tokens.
here is other paper you could read
Other good algorithm is a scam based algorithm that consists on detecting plagiarism by making comparison on a set of words that are common between test document
and registered document. Our plagiarism detection system, like many Information Retrieval systems, is evaluated with metrics of precision and recall.

You could take a look at Dick Grune's similarity comparator, which claims to work on natural language texts as well (I've only tried it on software). The algorithms are described as well. (By the way, his book on parsing is really good, in my opinion.)


How does SVM work? [closed]

Is it possible to provide a high-level, but specific explanation of how SVM algorithms work?
By high-level I mean it does not need to dig into the specifics of all the different types of SVM, parameters, none of that. By specific I mean an answer that explains the algebra, versus solely a geometric interpretation.
I understand it will find a decision boundary that separates the data points from your training set into two pre-labeled categories. I also understand it will seek to do so by finding the widest possible gap between the categories and drawing the separation boundary through it. What I would like to know is how it makes that determination. I am not looking for code, rather an explanation of the calculations performed and the logic.
I know it has something to do with orthogonality, but the specific steps are very "fuzzy" everywhere I could find an explanation.
Here's a video that covers one seminal algorithm quite nicely. The big revelations for me are (1) optimize the square of the critical metric, giving us a value that's always positive, so that minimizing the square (still easily differentiable) gives us the optimum; (2) Using a simple, but not-quite-obvious "kernel trick" to make the vector classifications compute easily.
Watch carefully at how unwanted terms disappear, leaving N+1 vectors to define the gap space in N dimensions.
I'll give you a very small details that will help you to continue understanding how SVM works.
make everything simple, 2 dimensions and linearly seperable data. The general idea in SVM is to find a hyperplan that maximize the margine between two classes. each of your data is a vector from the center. One you suggest a hyperplan, you project you data vector into the vector defining the hyperplan and then you see if the length of you projected vector is before or after the hyperplan and this is how you define your two classes.
This is very simple way of seeing it, and then you can go into more details by following some papers or videos.

What's a good selective pressure to use in tournament selection in a genetic algorithm? [closed]

What is the optimal and usual value of selective pressure in tournament selection? What percent of the best members of the current generation should propagate to the next generation?
Unfortunately, there isn't a great answer to this question. The optimal parameters will vary from problem to problem, and people use a wide range of them. Selecting the right tournament selection parameters is currently more of an art than a science. Stronger selective pressure (a larger tournament) will generally result in the population converging on a solution faster, at the cost of that solution potentially not being as good. This is called the exploration vs. exploitation tradeoff, and it underlies most algorithms for searching a large space of possible solutions - you're not going to get away from it.
I know that's not very helpful, though - you want a starting place, and that's completely reasonable. So here's the best one I know of (and I know a number of others who use it as a go-to default tournament configuration as well): a tournament size of two. Basically, this means you just keep picking random pairs of solutions, choosing the best one, and sending it to the next generation (with mutation and crossover as desired), until the next generation is the desired size. This has the nice property that any member of the population besides the absolute worst has a chance of getting to the next generation, but better ones have a better chance.

Algorithms under Plagiarism detection machines [closed]

I'm very impressed to how plagiarism checkers (such as Turnitin website ) works. But how do they do that ? In a very effective way, I'm new to this area thus is there any word matching algorithm or anything that is similar to that is used for detecting alike sentences?
Thank you very much.
I'm sure many real-world plagiarism detection systems use more sophisticated schemes, but the general class of problem of detecting how far apart two things are is called the edit distance. That link includes links to many common algorithms used for this purpose. The gist is effectively answering the question "How many edits must I perform to turn one input into the other?". The challenge for real-world systems is performing this across a large corpus in an efficient manner. A related problem is the longest common subsequence, which might also be useful for such schemes to identify passages that are copied verbatim.

Is there any online judge for data mining [closed]

There are many Online Judges (OJ) for ACM/ICPC questions. And another Online Judge for Interview questions, named Leetcode (
I think these OJs are very useful for us to learn algorithms. Recently, I am going to learn data mining algorithms. Is there any OJ for data mining questions?
Thank you very much.
There is MLcomp, where you can submit an algorithm and it will run it on a number of data sets to judge how well it is doing.
Plus, there is Kaggle, which hosts various classification competitions.
And of course you can do classes at Cousera. These are pretty much low level, but in order to get submission points you need to reproduce the known performance.
In particular the first also allows you to run several standard algorithms such as naive bayes and SVM and see how well they did. Obviously, your own implementation should perform similar then.
Unfortunately, both are pretty much focused on machine learning (i.e. classification and regression). There is very little in the unsupervised domain, clustering and outlier detection. On unlabeled data, things get too hard even to evaluate locally, so doing any kind of online judging is pretty much unsolved. What you can do is largely a one-class classification, or you just strip labels before running the algorithm.

Minimum Knowledge required in Datastructres [closed]

I have found numerous data-structures on wikipedia, also have also looked into several books in data-structures and found that they vary. I want to know what are the basic or minimum list of data-structure knowledge a new CS graduate should have?
Also is it necessary to know their implementation in more than one programming knowledge considering there is a difference in the implementation. If i know the implementation of Linked list in C should i know its Java based implementation?
It would be great if you could help me understand categorically:
Basic datastructures(necessary for a CS grad)
Advanced Datastructures
Edit : i am more interested in the list of data structures.
Have a look at Introduction to Algorithms by Cormen et al. In my experience, if you know what is in there you are set for anything coming at you.
I would not consider knowing any implementation very useful. If you know the basics you should be able to implement your own version quickly, but chances are you will never have to because there are libraries for that. So the rule for practice is: know your libraries!
Even so, it is important that you know properties of data structures (e.g. space overhead, runtimes of central operations, behaviour under concurrent accesses, (im)mutability, ...) so you will always use the one best suited to your task at hand.
This question is really a little too broad, even the way you've narrowed it down, because it depends on what sort of future path you're looking at. Grad school? PhD track? Industry? Which industry?
But as a rough minimum, I'd say, take a look at CLRS (as Raphael suggests) and pick out the following:
Linked lists, and the variations like stacks, queues, etc.
Basic heaps
Basic hash tables
Trees, especially including binary search trees, and preferably familiarity with at least one self-balancing BST
Graphs, both matrix-representation and adjacency list representation
And probably some more based on what sort of job you're looking for. As someone on a PhD track... well. All of them. At some point you will take a qualifier and be expected to know most of them.
Check out the MIT's OCW Intro to Algorithm Course It is great tutorial theoretically.
For practicing data structures in Java check : Data Structures & Algorithms in Java by Robert Lafore, it is excellent.
Implementation in one language is sufficient, but try to solve it in structured-oriented language like C and OO language like Java/ C++. This will help a lot while preparing for interviews.
One good resource for basic data structures in C : here
