Frequent itemset best algorithm and library - performance

I started working on machine learning projects a few days ago and I have the following situation:
I have a database of itineraries (an itinerary is a set of destinations that where selected together as part of a trip) and I want to identify if a destination is going to be selected as part of a trip given other selected destinations. Here is an example given that A, B, C, D are destinations:
A, B -> C
A, D, C -> B
I think this is a recommender system problem and I studied techniques to approach a solution.
I tried using WEKA's Apriori and FPGrowth but I have not been able to generate a result as I have 91 items and 12,000 transactions (so, that is an ARFF file that contains 91 columns and 12,000 rows with TRUE and FALSE values) and the program never ends nor consumes more than 5 GB of RAM (I waited 30hours for the algorithm to run on a Core i7 last gen and 12GB RAM PC). Also, I don't see any option to select only the rules that have a value of TRUE as an implication (I need this as I want to see if someone will travel to X given the fact that some other people travel to Y.
So, are there any other techniques or approaches that can be used to achieve the result I am expecting? I want to have as an output a file with the "rules" or the set of items that "imply" the other set of items, and the probability of that "recommendation".
Example:
A, B -> C ; 90%
verbose: "People who travel to Rome and Florence travel to Milan with a probability (or other measure) of 90%"
Thanks!

Something with your implementation of the Apriori algorithm doesn't seem right. Try to use another implementation of the Apriori algorithm or check the current implementation. For the stated purpose to generate association rules between the destinations are the Apriori or the faster FP-Growth algorithm just fine. Maybe this helps for a general understanding: R - association rules - apriori

Actually, the implementation in Weka is quite inefficient. You could check the SPMF data mining library in Java, which offers efficient implementations of algorithms for pattern mining. It has actually more than 100 algorithms, including Apriori, FPGrowth and many others. I would recommend to use FPGrowth which is very fast and memory efficient. But you could also check the other algorithms. By the way, I am the founder of the librar.

Related

Contextual-Bandit Approach: Algorithm 1 LinUCB with disjoint linear models

I am trying to implement the algorithm called LinUCB with disjoint linear models from this paper "A Contextual-Bandit Approach to Personalized News Article Recommendation" http://rob.schapire.net/papers/www10.pdf
This is the algorithm:
Algorithm 1 LinUCB with disjoint linear models
I am confused about the features vector Xt,a (I highlighted on the algorithm).
Is the feature vector related to information (context) of the article(arm) or the user?
I would appreciate your help.
Thank you
The feature vector x_t,a applies to both the user and the arm.
The vector xt,a summarizes information of both the user ut and arm a,
and will be referred to as the context.
In the most general case, the feature vector X_(t,a) is allowed to be a function of both the user context c_t and the arm a, or X_(t, a) = phi(c_t, a). Note that this could simply be a subset: each arm might have different features it uses to predict an outcome, or in other words X_(t, a) is a subset of c_t.
For example, if a movie recommendation website is deciding which movie to recommend, they might need different information from the user when attempting to predict if he'll like sci-fi movies versus drama movies. This different information is reflected in the fact that features are allowed to vary by arm.
Alternatively, it might be the case that X_(t, a) is the same for all a, i.e. X_(t, a) = X_t. For example, when trying to learn the best medical dosage, the algorithm might want to know height, weight, and age for all the patients. In this case the features would not vary by arm.

Need a better explanation of Communication Cost Model for MapReduce than in MMDS

I was going through MMDS book that has an online MOOC by the same name. I'm having trouble understanding Communication Cost Model and the Join Operation Calculations mentioned in Topic 2.5 and am surprised by how poorly organized the book is as the MOOC covers the same topic within the "Advanced Topics/Computation Complexity of MapReduce" at the end of the course.
There's an exercise question (example did not help at all) that goes like:
We wish to take the join R(A,B) |><| S(B,C) |><| T(A,C) as a single
MapReduce process, in a way that minimizes the communication cost. We shall use 512 Reduce tasks, and the sizes of relations R, S, and T are 220 = 1,048,576, 217 = 131,072, and 214 = 16,384, respectively. Compute the number of buckets into which each of the attributes A, B, and C are to be hashed. Then, determine the number of times each tuple of R, S, and T is replicated by the Map function.
Could you walk me through it. I don't know how he jumps from simple R+S+T to Lagrange's identities without having deliberated on intermediary steps.

Effective clustering of a similarity matrix

my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php.
1) Similarity:
I treat every document as a "bag-of-words" and convert words into vectors. I use
filtering (only "real" words)
tokenization (split sentences into words)
stemming (reduce words to their base form; Porter's stemmer)
pruning (cut of words with too high & low frequency)
as methods for dimensionality reduction. After that, I'm using cosine similarity (as suggested / described on various sites on the web and here.
The result then is a similarity matrix like this:
A B C D E
A 0 30 51 75 80
B X 0 21 55 70
C X X 0 25 10
D X X X 0 15
E X X X X 0
A…E are my texts and the number is the similarity in percent; the higher, the more similar the texts are. Because sim(A,B) == sim(B,A) only half of the matrix is filled in. So the similarity of Text A to Text D is 71%.
I want to generate a a priori unknown(!) number of clusters out of this matrix now. The clusters should represent the similar items (up to a certain stopp criterion) together.
I tried a basic implementation myself, which was basically like this (60% as a fixed similarity threshold)
foreach article
get similar entries where sim > 60
foreach similar entry
check if one of the entries already has a cluster number
if no: assign new cluster number to all similar entries
if yes: use that number
It worked (somehow), but wasn't good at all and the results were often monster-clusters.
So, I want to redo this and already had a look into all kinds of clustering algorithms, but I'm still not sure which one will work best. I think it should be an agglomerative algoritm, because every pair of texts can be seen as a cluster in the beginning. But still the questions are what the stopp criterion is and if the algorithm should divide and / or merge existing clusters together.
Sorry if some of the stuff seems basic, but I am relatively new in this field. Thanks for the help.
Since you're both new to the field, have an unknown number of clusters and are already using cosine distance I would recommend the FLAME clustering algorithm.
It's intuitive, easy to implement, and has implementations in a large number of languages (not PHP though, largely because very few people use PHP for data science).
Not to mention, it's actually good enough to be used in research by a large number of people. If nothing else you can get an idea of what exactly the shortcomings are in this clustering algorithm that you want to address in moving onto another one.
Just try some. There are so many clustering algorithms out there, nobody will know all of them. Plus, it also depends a lot on your data set and the clustering structure that is there.
In the end, there also may be just this one monster cluster with respect to cosine distance and BofW features.
Maybe you can transform your similarity matrix to a dissimilarity matrix such as transforming x to 1/x, then your problem is to cluster a dissimilarity matrix. I think the hierarchical cluster may work. These may help you:hierarchical clustering and Clustering a dissimilarity matrix

Looking for a model to represent this problem, which I suspect may be NP-complete

(I've changed the details of this question to avoid NDA issues. I'm aware that if taken literally, there are better ways to run this theoretical company.)
There is a group of warehouses, each of which are capable of storing and distributing 200 different products, out of a possible 1000 total products that Company A manufactures. Each warehouse is stocked with 200 products, and assigned orders which they are then to fill from their stock on hand.
The challenge is that each warehouse needs to be self-sufficient. There will be an order for an arbitrary number of products (5-10 usually), which is assigned to a warehouse. The warehouse then packs the required products for the order, and ships them together. For any item which isn't available in the warehouse, the item must be delivered individually to the warehouse before the order can be shipped.
So, the problem lies in determining the best warehouse/product configurations so that the largest possible number of orders can be packed without having to order and wait for individual items.
For example (using products each represented by a letter, and warehouses capable of stocking 5 product lines):
Warehouse 1: [A, B, C, D, E]
Warehouse 2: [A, D, F, G, H]
Order: [A, C, D] -> Warehouse 1
Order: [A, D, H] -> Warehouse 2
Order: [A, B, E, F] -> Warehouse 1 (+1 separately ordered)
Order: [A, D, E, F] -> Warehouse 2 (+1 separately ordered)
The goal is to use historical data to minimize the number of individually ordered products in future. Once the warehouses had been set up a certain way, the software would just determine which warehouse could handle an order with minimal overhead.
This immediately strikes me as a machine learning style problem. It also seems like a combination of certain well known NP-Complete problems, though none of them seem to fit properly.
Is there a model which represents this type of problem?
If I understand correctly, you have to separate problems :
Predict what should each warehouse pre-buy
Get the best warehouse for an order
For the first problem, I point you to the netflix prize : this was almost the same problem, and great solutions have been proposed. (My datamining handbook is at home and I can't remember for precise keyword to google, sorry.Try "data mining time series" )
For the second one, this is a problem for Prolog.
Set a cost for separately ordering an item
Set a cost, for, idk, proximity to the customer
Set the cost for already owning the product to 0
Make the rule to get a product : buy it if you don't have it, get it if you do
Make the rule to get all products : foreach product, rule above
get the cost of this rule
Gently ask Prolog to get a solution. If it's not good enough, ask more.
If you don't want to use Prolog, there are several constraints libraries out there. Just google "constraint library <insert your programming language here>"
The first part of the problem (which items are frequently ordered together) is sometimes known as the co-occurrence problem, and is a big part of the data mining literature. (My recollection is that the problem is in NP, but there exist quite good approximate algorithms).
Once you have co-occurrence data you are happy with, you are still left with the assignment of items to warehouses. It's a little like the set-covering problem, but not quite the same. This problem is NP-hard.

Time Aware Social Graph DS/Queries

Classic social networks can be represented as a graph/matrix.
With a graph/matrix one can easily compute
shortest path between 2 participants
reachability from A -> B
general statistics (reciprocity, avg connectivity, etc)
etc
Is there an ideal data structure (or a modification to graph/matrix) that enables easy computation of the above while being time aware?
For example,
Input
t = 0...100
A <-> B (while t = 0...10)
B <-> C (while t = 5...100)
C <-> A (while t = 50...100)
Sample Queries
Is A associated with B at any time? (yes)
Is A associated with B while B is associated with C? (yes. #t = 5...10)
Is C ever reachable from A (yes. # t=5 )
What you're looking for is an explicitly persistent data structure. There's a fair body of literature on this, but it's not that well known. Chris Okasaki wrote a pretty substantial book on the topic. Have a look at my answer to this question.
Given a full implementation of something like Driscoll et al.'s node-splitting structure, there are a few different ways to set up your queries. If you want to know about stuff true in a particular time range, you would only examine nodes containing data about that time range. If you wanted to know what time range something was true, you would start searching, and progressively tighten your bounds as you explore each new node. Just remember that your results might not always be contiguous - consider two people start dating, breaking up, and getting back together.
I would guess that there's probably at least one publication worth of unexplored territory in how to do interesting queries over persistent graphs, if not much more.

Resources