So a recent question made me aware of the rather cool apriori algorithm. I can see why it works, but what I'm not sure about is practical uses. Presumably the main reason to compute related sets of items is to be able to provide recommendations for someone based on their own purchases (or owned items, etcetera). But how do you go from a set of related sets of items to individual recommendations?
The Wikipedia article finishes:
The second problem is to generate
association rules from those large
itemsets with the constraints of
minimal confidence. Suppose one of the
large itemsets is Lk, Lk = {I1, I2, …
, Ik}, association rules with this
itemsets are generated in the
following way: the first rule is {I1,
I2, … , Ik-1}⇒ {Ik}, by checking the
confidence this rule can be determined
as interesting or not. Then other rule
are generated by deleting the last
items in the antecedent and inserting
it to the consequent, further the
confidences of the new rules are
checked to determine the
interestingness of them. Those
processes iterated until the
antecedent becomes empty
I'm not sure how the set of association rules helps in determining the best set of recommendations either, though. Perhaps I'm missing the point, and apriori is not intended for this use? In which case, what is it intended for?
So the apriori algorithm is no longer the state of the art for Market Basket Analysis (aka Association Rule Mining). The techniques have improved, though the Apriori principle (that the support of a subset upper bounds the support of the set) is still a driving force.
In any case, the way association rules are used to generate recommendations is that, given some history itemset, we can check each rule's antecedant to see if is contained in the history. If so, then we can recommend the rule's consequent (eliminating cases where the consequent is already contained in the history, of course).
We can use various metrics to rank our recommendations, since with a multitude of rules we may have many hits when comparing them to a history, and we can only make a limited number of recommendations. Some useful metrics are the support of a rule (which is the same as the support of the union of the antecedant and the consequant), the confidence of a rule (the support of the rule over the support of the antecedant), and the lift of a rule (the support of the rule over the product of the support of the antecedant and the consequent), among others.
If you want some details about how Apriori can be used for classification you coul read the paper about the CBA algorithm:
Bing Liu, Wynne Hsu, Yiming Ma, "Integrating Classification and Association Rule Mining." Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98, Plenary Presentation), New York, USA, 1998
Related
Rete Algorithm is an efficient pattern matching algorithm that compares a large collection of patterns to a large collection of objects. It is also used in one of the expert system shell that I am exploring right now: is drools.
What is the time complexity of the algorithm, based on the number of rules I have?
Here is a link for Rete Algorithm: http://www.balasubramanyamlanka.com/rete-algorithm/
Also for Drools: https://drools.org/
Estimating the complexity of RETE is a non-trivial problem.
Firstly, you cannot use the number of rules as a dimension. What you should look at are the single constraints or matches the rules have. You can see a rule as a collection of constraints grouped together. This is all what RETE reasons about.
Once you have a rough estimate of the amount of constraints your rule base has, you will need to look at those which are inter-dependent. Inter-dependent constraints are the most complex matches and are pretty similar in concept as JOINS in SQL queries. Their complexity varies based on their nature as well as the state of your working memory.
Then you will need to look at the size of your working memory. The amount of facts you assert within a RETE based expert system strongly influence its performance.
Lastly, you need to consider the engine conflict resolution strategy. If you have several conflicting rules, it might take a lot of time to figure out in which order to execute them.
Regarding RETE performance, there is a very good PhD dissertation I'd suggest you to look at. The author is Robert B. Doorenbos and the title is "Production matching for large learning systems".
I have a question for which I have made some solutions, but I am not happy with the scalability. I'm looking for input of some different approaches / algorithms to solving it.
Problem:
Software can run on electronic controllers (ECUs) and requires
different resources to run a given feature. It may require a given
amount of storage or RAM or a digital or Analog Input or Output for
instance. If we have multiple features and multiple controller options
we want to find the combination that minimizes the hardware
requirements (cost). I'll simplify the resources to letters to
simplify the understanding.
Example 1:
Feature1(A)
ECU1(A,B,C)
First a trivial example. Lets assume that a feature requires 1 unit of resource A, and ECU has 1 unit of resources A, B and C available, it is obvious that the feature will fit in the ECU with resources B & C left over.
Example 2:
Feature2(A,B)
ECU2(A|B,B,C)
In this example, Feature 2 requires resources A and B, and the ECU has 3 resources, the first of which can be A or B. In this case, you can again see that the feature will fit in the ECU, but only if check in a certain order. If you assign F(A) to E(A|B), then F(B) to E(B) it works, but if you assign F(B) to E(A|B) then there is no resource left on the ECU for F(A) so it doesn't appear to fit. This would lead one to the observation that we should prefer non-OR'd resources first to avoid such a conflict.
An example of the above could be a an analog input could also be used as a digital input for instance.
Example 3
Feature3(A,B,C)
ECU3(A|B|C, B|C, A|C)
Now things are a little bit more complicated, but it is still quite obvious to a person that the feature will fit into the ECU.
My problems are simply more scaled up versions of these examples (i.e. multiple features per ECU with more ECUs to choose from.
Algorithms
GA
My first approach to this was to use a genetic algorithm. For a given set of features i.e. F(A,B,C,D), and a list of currently available ECUs find which single or combination of ECUs fit the requirements.
ECUs would initially be randomly selected and features checked they fitted and added to them. If a feature didn't fit another ECU was added to the architecture. A population of these architectures was created and ranked based on lowest cost of housing all the features. Architectures could then be mated in successive generations with mutations and such to improve fitness.
This approached worked quite well, but tended to get stuck in local minima (not the cheapest option) based on a golden example I had worked by hand.
Combinatorial / Permutations
My next approach was to work out all of the possible permutations (the ORs from above) for an ECU to see if the features fit.
If we go back to example 2 and expand the ORs we get 2 permutations;
Feature2(A,B)
ECU2(A|B,B,C) = (A,B,C), (B,B,C)
From here it is trivial to check that the feature fits in the first permutation, but not the second.
...and for example 3 there are 12 permutations
Feature3(A,B,C)
ECU3(A|B|C, B|C, A|C) = (A,B,A), (B,B,A), (C,B,A), (A,C,A), (B,C,A), (C,C,A), (A,B,C), (B,B,C), (C,B,C), (A,C,C), (B,C,C), (C,C,C)
Again it is trivial to check that feature 3 fits in at least one of the permutations (3rd, 5th & 7th).
Based on this approach I was also able to get a solution also, but I have ECUs with so many OR'd inputs that I have millions of ECU permutations which drastically increased the run time (minutes). I can live with this, but first wanted to see if there was a better way to skin the cat, apart from Parallelizing this approach.
So that is the problem...
I have more ideas on how to approach it, but assume that there is a fancy name for such a problem or the name of the algorithm that has been around for 20+ years that I'm not familiar with and I was hoping someone could point me in that direction to either some papers or the names of relevant algorithms.
The obvious remark of simply summing the feature resource requirements and creating a new monolithic ECU is not an option. Lastly, no, this is not in any way associated with any assignment or problem given by a school or university.
Sorry for the long question, but hopefully I've sufficiently described what I am trying to do and this peaks the interest of someone out there.
Sincerely, Paul.
Looks like individual feature plug can be solved as bipartite matching.
You make bipartite graph:
left side corresponds to feature requirements
right side corresponds to ECU subnodes
edges connect each left and right side vertixes with common letters
Let me explain by example 2:
Feature2(A,B)
ECU2(A|B,B,C)
How graph looks:
2 left vertexes: L1 (A), L2 (B)
3 right vertexes: R1 (A|B), R2 (B), R3 (C)
3 edges: L1-R1 (A-A|B), L2-R1 (B-A|B), L2-R2 (B-B)
Then you find maximal matching for unordered bipartite graph. There are few well-known algorithms for it:
https://en.wikipedia.org/wiki/Matching_(graph_theory)
If maximal matching covers every feature vertex, we can use it to plug feature.
If maximal matching does not cover every feature vertex, we are short of resources.
Unfortunately, this approach works like greedy algorithms. It does not know of upcoming features and does not tweak solution to fit more features later. Partially optimization for simple cases can work like you described in question, but in general it's dead end - only algorithm that accounts for every feature in whole feature set can make overall effective solution.
You can try to add several features to one ECU simultaneously. If you want to add new feature to given ECU, you can try all already assigned features plus candidate feature. In this case local optimum solution will be found for given feature set (if it's possible to plug them all to one ECU).
I've not enough reputation to comment, so here's what i wanted to propose for your problem:
Like GA there are some other Random Based approaches too e.g. Bayesian Apporaoch , Decision Tree etc.
In my opinion Decision Tree will suit your problem as it, against some input dataset/attributes, shows a path to each class(in your case ECUs) that helps to select right class/ECU. Train your system with some sample data sets so that it can decide right ECU for your actual data set/Features.
Check Decision Trees - Machine Learning for more information. Hope it helps!
How can determine how many rules and fuzzy sets we need in our fuzzy system?
Is by increasing the rules and fuzzy sets, the system would be better?
How can we determine how many rules and fuzzy sets we need actually for better results?
Thanks
There are many different methods of determining where you need to model fuzziness in a particular application. The overarching principles to keep in mind are: 1) Look for places where it would be beneficial to treat ordinal or nominal data on a continuous scale, even at the cost of imprecision and 2) The "fuzz" should be naturally present in the data or problem you're trying to solve; it's not a secret ingredient one adds to make an application better, as is sometimes implied by overeager enthusiasts. Only add fuzzy rules and sets where it when you can justify the added computational/data collection/other costs in terms of greater accuracy or some other practical use.
With those principles in mind, here are some ways of detecting places where fuzzy rules and sets might be useful:
• The number one candidate is natural language modeling, perhaps through a Behavioral-Driven Development (BDD) process if you're in a software development environment. For example, you can interview people with domain knowledge and look for naturally fuzzy statements, such as "cloudy," "overcast" and "sunny" in meteorology, or fuzzy numbers, like "about half" or "most." Then find membership functions that most accurately match the meaning assigned to those terms. Note that sometimes terms from multiple fuzzy sets can occur together; for example, you grade the truth of the statement "about half of these days were cloudy," which might require three separate membership functions, one for truth, one for the fuzzy number and a third for the "cloudy" category. Linguistic analysis is the simplest way, since people naturally use fuzzy language every day; be aware though that multiple fuzzy sets can actually be combined to model fuzzy logic curiosities that don't often occur in natural language, like "“John is taller than he is clever,” “Inventory is higher than it is low,” “Coffee is at least as unhealthy as it is tasty,” and “Her last novel is more political than it is confessional.” Those examples come from p. 16, Bilgic, Taner and Turksen, I.B. August 1994, “Measurement–Theoretic Justification of Connectives in Fuzzy Set Theory,” pp. 289–308 in Fuzzy Sets and Systems, January 1995. Vol. 76, No. 3.
• Another important task is sorting out how to model "linguistic connectives" like fuzzy ANDs and ORs, or crisp conjunctions between fuzzy statements. Some guiding principles have been worked out and are available in such sources as Alsina, C.; Trillas E. and Valverde, L., 1983, “On Some Logical Connectives for Fuzzy Sets Theory," pp. 15-26 in Journal of Mathematical Analysis and Applications. Vol. 93; Dubois, Didier and Prade, Henri, 1985, “A Review of Fuzzy Set Aggregation Connectives," pp. 85-121 in Information Sciences, July-August, 1985. Vol. 36, Nos. 1-2.
• Pooling the opinions of experts (as in an expert system) or the subjective scores of others (as in a movie ratings system). The ratings themselves would constitute one level of fuzziness, while another tier could be added to weight the importance of each expert or other individual's particular score, if they're particularly authoritative.
• Another option is to use neural nets to determine whether or not the addition of various fuzzy rules and sets to your model actually improves accuracy or some other metric related to your end goal.
• Other options include estimating membership functions and the parameters of T-norms and T-conorms (which are used often in fuzzy complements, unions and intersections) with such techniques as regression, Maximum Likelihood Estimation (MLE), LaGrange interpolation, curve fitting and parameter estimation. All of these are discussed in my favorite reference for fuzzy set math, Klir, George J. and Yuan, Bo, 1995, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall: Upper Saddle River, N.J.
• In the same vein, deciding whether or not to include particular fuzzy rules or sets may depend on whether or not you can find a good fit between its underlying and possibly unknown "actual" membership function and the one's your testing. Most of the time triangular, trapezoidal or Gaussian functions will suffice, but in some situations distribution testing might be necessary to find just the right distribution function. Empirical Distribution Functions (EDFs) might come in handy here.
To make a long story short, a lot of different statistical and machine learning techniques can be applied to give ballpark answers to these questions. The key is to always stay within the bounds of the two main principles above and only model things with fuzzy sets when it would serve your practical goals, then leave out the rest. I hope that helps.
In data mining, frequent itemset are found using different algorithms like Apriori Algorithm , FP-Tree , etc. So are these the pattern evaluation methods?
You can try Association Rules (apriori for example), Collaborative Filtering (item-based or user-based) or even Clustering.
I don't know what you are trying to do, but if you have a data-set and you need to find the most frequent item-set you should try some of the above techniques.
If you're using R you should explore the arules package for association rules (for example).
Apriori algorithm and FP-tree algorithm is used to find frequent itemsets for the given transactional data. This would help in market basket analysis applications. For pattern evaluation, there are many components namely:
support,
confidence,
Lift,
Imbalance ratio, etc.
More details can be seen at the paper:
Selecting the right interestingness measure for association patterns by Pang Ning Tan, Vipin Kumar, Jaideep Srivastava, KDD 2002.
Refer URL:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.1494&rep=rep1&type=pdf
I'm attempting to write some code for item based collaborative filtering for product recommendations. The input has buyers as rows and products as columns, with a simple 0/1 flag to indicate whether or not a buyer has bought an item. The output is a list similar items for a given purchased, ranked by cosine similarities.
I am attempting to measure the accuracy of a few different implementations, but I am not sure of the best approach. Most of the literature I find mentions using some form of mean square error, but this really seems more applicable when your collaborative filtering algorithm predicts a rating (e.g. 4 out of 5 stars) instead of recommending which items a user will purchase.
One approach I was considering was as follows...
split data into training/holdout sets, train on training data
For each item (A) in the set, select data from the holdout set where users bought A
Determine which percentage of A-buyers bought one of the top 3 recommendations for A-buyers
The above seems kind of arbitrary, but I think it could be useful for comparing two different algorithms when trained on the same data.
Actually your approach is quiet similar with the literature but I think you should consider to use recall and precision as most of the papers do.
http://en.wikipedia.org/wiki/Precision_and_recall
Moreover if you will use Apache Mahout there is an implementation for recall and precision in this class; GenericRecommenderIRStatsEvaluator
Best way to test a recommender is always to manually verify that the results. However some kind of automatic verification is also good.
In the spirit of a recommendation system, you should split your data in time, and see if you algorithm can predict what future buys the user does. this should be done for all users.
Don't expect that it can predict everything, a 100% correctness is usually a sign of over-fitting.