Split text files into two groups - unsupervised learning [closed] - text-classification

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Imagine, you are a librarian and during time you
have classified a bunch of text files (approx 100)
with a general ambiguous keyword.
Every text file is actually a topic of keyword_meaning1
or a topic of keyword_meaning2.
Which unsupervised learning approach would you use,
to split the text files into two groups?
What precision (in percentage) of correct classification
can be achieved according to a number of text files?
Or can be somehow indicated in one group, that there is
a need of a librarian to check certain files, because
they may be classifed incorrectly?

The easiest starting point would be to use a naive Bayes classifier. It's hard to speculate about the expected precision. You have to test it yourself. Just get a program for e-mail spam detection and try it out. For example, SpamBayes (http://spambayes.sourceforge.net/) is a quite good starting point and easily hackable. SpamBayes has a nice feature that it will label messages as "unsure" when there is no clear separation between two classes.
Edit: When you really want unsupervised clustering method, then perhaps something like Carrot2 (http://project.carrot2.org/) is more appropriate.

Related

Machine learning and actual predictions [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a question about machine learning regarding predictions.
So typically I would have a dataset with x's and y's that i would train my algo on. But what if I just have a dataset with input variables only (x's) and no actual predictions (y's)?
For example, im looking for fradulent transactions.
In dataset A i have a bunch of input variables like amounts, zipcodes, merchant, etc. and i have a fraud status variable that says 1 for possible fraud, 0 for safe transaction. Here i have known frauds/known non frauds that i can train my model on.
However, what if i have a dataset where there is no fraud varaible. All i have is my input variables and no variable that indicates whether it is fraud or not. How could an ML algo then predict the probability of it being a fraudulent transaction for this specific dataset?
I think what you are looking for is anomaly detection. In anomaly detection, you will try to find the datapoints, which are different from the rest of the data points, in your case it is fraudulent transaction.
There are quite a few algorithms available in sklearn, look here. I would recommend start with IsolationForest model for your problem.
From Documentation.

Find duplicate images algorithm [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I want to create a program that find the duplicate images into a directory, something like this app does and I wonder what would be the algorithm to determine if two images are the same.
Any suggestion is welcome.
This task can be solved by perceptual-hashing, depending on your use-case, combined with some data-structure responsible for nearest-neighbor search in high-dimensions (kd-tree, ball-tree, ...) which can replace the brute-force search (somewhat).
There are tons of approaches for images: DCT-based, Wavelet-based, Statistics-based, Feature-based, CNNs (and more).
Their designs are usually based on different assumptions about the task, e.g. rotation allowed or not?
A google scholar search on perceptual image hashing will list a lot of papers. You can also look for the term image fingerprinting.
Here is some older ugly python/cython code doing the statistics-based approach.
Remark: Digikam can do that for you too. It's using some older Haar-wavelet based approach i think.

Writing an algorithm to complete a checklist, where to start? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm completing a dissertation in health economics and would like to explore the possibility of using an algorithm to answer a checklist that I manually filled in during my research.
The checklist is a 24-item checklist which asks questions such as "Was a discount rate reported?". Now, the articles I've been reviewing tend to be very codified. That is, there are only a few ways that they report an answer (e.g. "we discounted at 3% in this evaluation").
Theoretically, I think it would be possible to write a program that could search text and fill out the majority of these checklist items. However, I have very little experience in programming. As far as I can see, a program like this would involve writing an algorithm of sorts, but that is where my knowledge ends.
Particularly, I would like to know
- Is this possible?
- If so, how would I go about exploring this further? Ideally, I'd like to get to a point where I could play around with writing an algorithm to look through my database.
This could definitely be done with simple logic and parsing, but the key to it would be that the manual entries are consistent in the way that they are "codified".
For example, you'd parse your line for a specific token (or validation word).
In your case above you could parse word for word the string: "we discounted at 3% in this evaluation"
Code wise we could implement a basic logical comparison to each word parsed in the string.
if(currentWord is equal to "discounted")
create a checkmark.

Algorithms under Plagiarism detection machines [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm very impressed to how plagiarism checkers (such as Turnitin website ) works. But how do they do that ? In a very effective way, I'm new to this area thus is there any word matching algorithm or anything that is similar to that is used for detecting alike sentences?
Thank you very much.
I'm sure many real-world plagiarism detection systems use more sophisticated schemes, but the general class of problem of detecting how far apart two things are is called the edit distance. That link includes links to many common algorithms used for this purpose. The gist is effectively answering the question "How many edits must I perform to turn one input into the other?". The challenge for real-world systems is performing this across a large corpus in an efficient manner. A related problem is the longest common subsequence, which might also be useful for such schemes to identify passages that are copied verbatim.

Max line count of one file? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
How many max number of lines of code one .cs file should hold. Are there any industry standards/best practices?
I don't know how much, but there is a number which is considered a best practice.
It's all about maintainability, and there's much more to it than the number of lines in a source file. What about number of methods in a class? Cyclomatic complexity of a method? Number of classes in a project?
It's best to run a code metrics tool on your source code to find out more, like NDepend.
As few as possible to do the job required whilst remaining readable.
If you're spilling over into the thousands upon thousands of line you might want to ask yourself what exactly this file is doing and how can you split it up to express the activity better.
Day to day, if I find a class which is more than 1000 lines long I am always suspicious that either the class is responsible for too much, or the responsibility is expressed badly.
However as with every rule of thumb, each case should be assessed on it's own merits.

Resources