I wrote data mining apriori algorithm, it works well on small test data but I am having issue to run it on bigger data sets.
I am trying to generate rules of items which were bought together frequently.
My small test data is 5 transactions and 10 products.
My big test data is 11 million transactions and around 2700 products.
Problem: Min-support and Filter non frequent items.
Lets imagine we are interested in items which frequency is 60% or more.
frequency = 0.60;
When I compute Min-support for a small data set with 60% frequency algorithm will remove all items which where bought less than 3 times. Min-support = numberOfTransactions * frequency;
But when I am trying to do the same thing for a large data set, algorithm will filter almost all item set after first iteration, just couple of items able to meet such plane.
So I've started decreasing that plane lower and lower, running algorithm many times. But not even 5% giving desired results. I had to lower my frequency percents until 0.0005 to get it at least 50% of items involved in first iteration.
What do you think about current situation is it might be a data problem, since it is generated artificially? (Microsoft adventure works version)
Or it is my code or min support computation problems?
Maybe you can offer any other solution or better way of doing this?
Thanks!
Maybe that is just how your data is like.
If you have a lot of different items, and few items per transaction, the chances of items co-occurring are low.
Did you verify the result, is it incorrectly pruning, or is the algorithm correct, and your parameters bad?
Can you actually name an itemset that Apriori pruned but that shouldn't have pruned?
The problem is, yes, choosing the parameters is hard. And no, apriori cannot use an adaptive threshold, because that wouldn't satisfy the monotonicity requirement. You must use the same threshold for all itemset sizes.
Actually, it all depends on your data. For some real datasets, I had to set the support threshold lower than 0.0002 to get some results. For some other datasets' i used 0.9. It really depends on your data.
By the way, there exists variation of Apriori and FPGrowth that can consider multiple minimum supports at the same time to use different threshold for different items. For example, CFP-Growth or MIS-Apriori. There also exists some algorithms specialized for mining rare itemsets or rare association rules. If you are interested by this topic, you could check my software which offers some of these algorithms : http://www.philippe-fournier-viger.com/spmf/
Related
I am using t-SNE to make a 2D projection for visualization from a higher dimensional dataset (in this case 30-dims) and I have a question about the perplexity hyperparameter.
It's been a while since I used t-SNE and had previously only used it on smaller datasets <1000 data points, where the advised perplexity of 5-50 (van der Maaten and Hinton) was sufficient to display the underlying data structure.
Currently, I am working with a dataset with 340,000 data points and feel that as the perplexity influences the local vs non-local representation of the data, more data points would require a perplexity much higher than 50 (especially if the data is not highly segregated in the higher dimensional space).
Does anyone have any experience with setting the optimal perplexity on datasets with a larger number of data points (>100k)?
I would be really interested to hear your experiences and which methods you go about using to determine the optimal perplexity (or optimal perplexity range).
An interesting article suggests that the optimal perplexity follows a simple power law (~N^0.5), would be interested to know what others think about that?
Thanks for your help
Largely this is empirical, and so I recommend just playing around with values. But I can share my experience...
I had a dataset of about 400k records, each of ~70 dimensions. I reran scikit learn's implementation of tsne with perplexity values 5, 15, 50, 100 and I noticed that the clusters looked the same after 50. I gathered that 5-15 was too small, 50 was enough, and increased perplexity didn't make much difference. That run time was a nightmare though.
The openTSNE implementation is much faster, and offers an interesting guide on how to use smaller and larger perplexity values at different stages of the same run of the algorithm in order to to get the advantages of both. Loosely speaking what it does is initiate the algorithm with high perplexity to find global structure for a small number of steps, then repeats the algorithm with the lower perplexity.
I used this implementation on a dataset of 1.5million records with dimension ~ 200. The data came from the same source system as the first dataset I mentioned. I didn't play with the perplexity values here because total runtime on a 32 cpu vm was several hours, but the clusters looked remarkably similar to the ones on the smaller dataset (recreated binary classification-esque distinct clusters), so I was happy.
I have a normal assignment problem, where I want to match workers to jobs. But there are several kinds of jobs, each with a set amount of positions. So for example I would need 10,000 builders, 5,000 welders etc. Each worker has of course the same preference for each position of the same kind of job.
My current approach is to use the Hungarian Algorithm and to just extend the matrix columns to account for that. So for example it would have 10,000 builder columns, 5,000 welder etc. Of course with O(n3) and a matrix that big, getting results may take a while.
Is there any variation of the Hungarian algorithm, or a different one, which uses the fact, that there can be multiple connections to one job node? Or should rather look into Monte Carlo or genetic search tree algorithms?
edit:
Formal description as Sascha proposed:
Set W for workers, J for jobs, weight function for the preference, function for the amount of jobs available
So the function I want to minimize would be:
where
Constraints would be:
and
As asked by Yay295, it would be ok if it ran for a day or two on a normal consumer machine. There are 50k workers right now with 10 kinds of jobs and 50k jobs total. So the matrix is 50k x 50k (extended) in the case of the Hungarian algorithm I'm using right now, or 50k x 10 for LP with the additional constraint , while and preference values in the matrix would go from 0-100.
This is actually called the Transportation Problem. The Transportation Problem is similar to the Assignment Problem in that they both have sources and destinations, but the Transportation Problem has two more values: each source has a supply, and each destination has a demand. The Assignment Problem is a simplification of the Transportation Problem in which the supply of each source and the demand of each destination is 1.
In your case, you have 50,000 sources (your workers) each with a supply of 1 (each worker can only work one job). You also have 10 destinations (the job types) each with some amount of demand (the number of openings for that type).
The Transportation Problem is traditionally solved with the Simplex Algorithm. I couldn't tell you how it works off the top of my head, but there is plenty of information available elsewhere online on how to do it. I would recommend these two videos: first, second.
Alternatively, the Transportation Problem can actually also be solved using the Hungarian Algorithm. The idea is to keep track of your supply and demand separately, and then use the Hungarian Algorithm (or any other algorithm for the Assignment Problem) to solve it as if the supply and demand were 1 (this can be incredibly fast when it's as lopsided as 50,000 sources to 10 destinations as in your case). Once you've solved it once, use the results to decrement the supply and demand of the assigned solution appropriately. Repeat until the sum of either supply or demand is zero.
However, none of this may be necessary. I wrote my own Assignment Problem solver in C++ a few years ago, and despite using 2.5GB of RAM, it can solve a 50,000 by 50,000 assignment problem in less than 5 seconds. The trick is to write your own. Before I wrote mine I had a look around at what was available online, and they were all pretty bad, often with obvious bugs. If you are going to write your own code for this though, it would be better to use the Simplex Algorithm as described in the videos I linked above. I don't know that one is faster than the other, but the Hungarian Algorithm wasn't made for the Transportation Problem.
ps: The same person who did the two lectures I linked above also did one on the Assignment Problem and the Hungarian Algorithm.
I am working with sample data set to learn clustering. This data set contains number of occurrences for the keywords.
Since all are number of occurrences for the different keywords, will it be OK not to scale the values and use them as it is?
I read couple of articles on internet where its emphasized that scaling is important as it will adjust the relativity of the frequency. Since most of frequencies are 0 (95%+), z score scaling will change the shape of distribution, which I am feeling could be problem as I am changing the nature of data.
I am thinking of not changing values at all to avoid this. Will that affect the quality of results I get from the clustering?
As it was already noted, the answer heavily depends on an algorithm being used.
If you're using distance-based algorithms with (usually default) Euclidean distance (for example, k-Means or k-NN), it'll rely more on features with bigger range just because a "typical difference" of values of that feature is bigger.
Non-distance based models can be affected, too. Though one might think that linear models do not get into this category since scaling (and translating, if needed) is a linear transformation, so if it makes results better, then the model should learn it, right? Turns out, the answer is no. The reason is that no one uses vanilla linear models, they're always used with with some sort of a regularization which penalizes too big weights. This can prevent your linear model from learning scaling from data.
There are models that are independent of the feature scale. For example, tree-based algorithms (decision trees and random forests) are not affected. A node of a tree partitions your data into 2 sets by comparing a feature (which splits dataset best) to a threshold value. There's no regularization for the threshold (because one should keep height of the tree small), so it's not affected by different scales.
That being said, it's usually advised to standardize (subtract mean and divide by standard deviation) your data.
Probably it depends on the classification algorithm. I'm only familiar with SVM. Please see Ch. 2.2 for the explanation of scaling
The type of feature (count of words) doesn't matter. The feature ranges should be more or less similar. If the count of e.g. "dignity" is 10 and the count of "have" is 100000000 in your texts, then (at least on SVM) the results of such features would be less accurate as when you scaled both counts to similar range.
The cases, where no scaling is needed are those, where the data is scaled implicitly e.g. features are pixel-values in an image. The data is scaled already to the range 0-255.
*Distance based algorithm need scaling
*There is no need of scaling in tree based algorithms
But it is good to scale your data and train model ,if possible compare the model accuracy and other evaluations before scaling and after scaling and use the best possibility
These is as per my knowledge
I have about 44 Million training examples across about 6200 categories.
After training, the model comes out to be ~ 450MB
And while testing, with 5 parallel mappers (each given enough RAM), the classification proceeds at a rate of ~ 4 items a second which is WAY too slow.
How can speed things up?
One way i can think of is to reduce the word corpus, but i fear losing accuracy. I had maxDFPercent set to 80.
Another way i thought of was to run the items through a clustering algorithm and empirically maximize the number of clusters while keeping the items within each category restricted to a single cluster. This would allow me to build separate models for each cluster and thereby (possibly) decrease training and testing time.
Any other thoughts?
Edit :
After some of the answers given below, i started contemplating doing some form of down-sampling by running a clustering algorithm, identifying groups of items that are "highly" close to one another and then taking a union of a few samples from those "highly" close groups and other samples that are not that tightly close to one another.
I also started thinking about using some form of data normalization techniques that involve incorporating edit distances while using n-grams (http://lucene.apache.org/core/4_1_0/suggest/org/apache/lucene/search/spell/NGramDistance.html)
I'm also considering using the hadoop streaming api to leverage some of the ML libraries available in Python from listed here http://pydata.org/downloads/ , and here http://scikit-learn.org/stable/modules/svm.html#svm (These I think use liblinear mentioned in one of the answers below)
Prune stopwords and otherwise useless words (too low support etc.) as early as possible.
Depending on how you use clustering, it may actually make in particular the test phase even more expensive.
Try other tools than Mahout. I found Mahout to be really slow in comparison. It seems that it somewhere comes at a really high overhead.
Using less training exampes would be an option. You will see that after a specific amount of training examples you classification accuracy on unseen examples won't increase. I would recommend to try to train with 100, 500, 1000, 5000, ... examples per category and using 20% for cross validating the accuracy. When it doesn't increase anymore, you have found the amount of data you need which may be a lot less then you use now.
Another approach would be to use another library. For document-classification i find liblinear very very very fast. It's may be more low-level then mahout.
"but i fear losing accuracy" Have you actually tried using less features or less documents? You may not lose as much accuracy as you fear. There may be a few things at play here:
Such a high number of documents are not likely to be from the same time period. Over time, the content of a stream will inevitably drift and words indicative of one class may become indicative of another. In a way, adding data from this year to a classifier trained on last year's data is just confusing it. You may get much better performance if you train on less data.
The majority of features are not helpful, as #Anony-Mousse said already. You might want to perform some form of feature selection before you train your classifier. This will also speed up training. I've had good results in the past with mutual information.
I've previously trained classifiers for a data set of similar scale and found the system worked best with only 200k features, and using any more than 10% of the data for training did not improve accuracy at all.
PS Could you tell us a bit more about your problem and data set?
Edit after question was updated:
Clustering is a good way of selecting representative documents, but it will take a long time. You will also have to re-run it periodically as new data come in.
I don't think edit distance is the way to go. Typical algorithms are quadratic in the length of the input strings, and you might have to run for each pair of words in the corpus. That's a long time!
I would again suggest that you give random sampling a shot. You say you are concerned about accuracy, but are using Naive Bayes. If you wanted the best model money can buy, you would go for a non-linear SVM, and you probably wouldn't live to see it finish training. People resort to classifiers with known issues (there's a reason Naive Bayes is called Naive) because they are much faster than the alternative but performance will often be just a tiny bit worse. Let me give you an example from my experience:
RBF SVM- 85% F1 score - training time ~ month
Linear SVM- 83% F1 score - training time ~ day
Naive Bayes- 82% F1 score - training time ~ day
You find the same thing in the literature: paper . Out of curiosity, what kind of accuracy are you getting?
My understanding of bootstrapping is that you
Build a "tree" using some algorithm from a matrix of sequences (nucleotides, lets say).
You store that tree.
Perturb the matrix from 1, and rebuild the tree.
My question is: what is the purpose of 3 from a sequence bioinformatics perspective? I can try to "guess" that, by changing characters in the original matrix, you can remove artifacts in the data? But I have a problem with that guess: I am not sure, why removal of such artifacts is necessary. A sequence alignment is supposed to deal with artifacts by finding long lenghts of similarity, by its very nature.
Bootstrapping, in phylogenetics as elsewhere, doesn't improve the quality of whatever you're trying to estimate (a tree in this case). What it does do is give you an idea of how confident you can be about the result you get from your original dataset. A bootstrap analysis answers the question "If I repeated this experiment many times, using a different sample each time (but of the same size), how often would I expect to get the same result?" This is usually broken down by edge ("How often would I expect to see this particular edge in the inferred tree?").
Sampling Error
More precisely, bootstrapping is a way of approximately measuring the expected level of sampling error in your estimate. Most evolutionary models have the property that, if your dataset had an infinite number of sites, you would be guaranteed to recover the correct tree and correct branch lengths*. But with a finite number of sites this guarantee disappears. What you infer in these circumstances can be considered to be the correct tree plus sampling error, where the sampling error tends to decrease as you increase the sample size (number of sites). What we want to know is how much sampling error we should expect for each edge, given that we have (say) 1000 sites.
What We Would Like To Do, But Can't
Suppose you used an alignment of 1000 sites to infer the original tree. If you somehow had the ability to sequence as many sites as you wanted for all your taxa, you could extract another 1000 sites from each and perform this tree inference again, in which case you would probably get a tree that was similar but slightly different to the original tree. You could do this again and again, using a fresh batch of 1000 sites each time; if you did this many times, you would produce a distribution of trees as a result. This is called the sampling distribution of the estimate. In general it will have highest density near the true tree. Also it becomes more concentrated around the true tree if you increase the sample size (number of sites).
What does this distribution tell us? It tells us how likely it is that any given sample of 1000 sites generated by this evolutionary process (tree + branch lengths + other parameters) will actually give us the true tree -- in other words, how confident we can be about our original analysis. As I mentioned above, this probability-of-getting-the-right-answer can be broken down by edge -- that's what "bootstrap probabilities" are.
What We Can Do Instead
We don't actually have the ability to magically generate as many alignment columns as we want, but we can "pretend" that we do, by simply regarding the original set of 1000 sites as a pool of sites from which we draw a fresh batch of 1000 sites with repetition for each replicate. This generally produces a distribution of results that is different from the true 1000-site sampling distribution, but for large site counts the approximation is good.
* That is assuming that the dataset was in fact generated according to this model -- which is something that we cannot know for certain, unless we're doing a simulation. Also some models, like uncorrected parsimony, actually have the paradoxical quality that under some conditions, the more sites you have, the lower the probability of recovering the correct tree!
Bootstrapping is a general statistical technique that has applications outside of bioinformatics. It is a flexible means of coping with small samples, or samples from a complex population (which I imagine is the case in your application.)