Apriori Algorithm- frequent item set generation

Apriori Algorithm- frequent item set generation - algorithm

I am using Apriori algorithm to identify the frequent item sets of the customer.Based on the identified frequent item sets I want to prompt suggest items to customer when customer adds a new item to his shopping list, As the frequent item sets I got the result as follows;
[1],[3],[2],[5]
[2.3],[3,5],[1,3],[2,5]
[2,3,5]
My problem is if I consider only [2,3,5] set to make suggestions to customer am I wrong? i.e If customer adds item 3 to his shopping list I would recommend item 2 and item 5. If customer adds item 1 to the shopping list no suggestions will be made since I am considering only set [2,3,5] and item 1 is not available in that set. I want to know whether my logic (considering only set [2,3,5]) is enough to make suggestions for the user

You should base on how the frequency of the item set is relative to its sub item sets to figure out the rule. For example
if frequency of (2,3,5) is close to the frequency of (3,5), the rule will be (3,5) -> 2
If frequency of (2,3,5) is close to the frequency of (3), the rule will be 3 -> (2,5)
If frequency of (2,3) is close to the frequency of (2), the rule will be 2 -> 3
That means not only largest frequent item set could be used to make rule but its sub frequent item sets also. And the rule will be more pricise if you could consider how close frequency of item sets is relative to others.

No. Deriving recommendation rules requires more effort.
Just because [2,3,5] is frequent does not mean 2 -> 3,5 is a good rule.
Consider the case that 2 is a very popular product, but 3,5 are just barely frequent. Consider a gas station. [gas, coffee, bagel] is probably a frequent itemset, but rather few customers who buy gas will also buy coffee and a bagel (low confidence).
You do want to consider rules such as 2,3 -> 5 because they may have higher confidence. I.e. if the customer buys gas and coffee, suggest a bagel.
Frequency is not sufficient for recommendations! Consider 2 and 3 are bought in 80% of cases. 2, 3, 5 is bought in 60% of cases. Naively, in 6 out of 8 times, the customer will also buy 5, that's 75% correct! But this does not mean 5 is a good recommendation! Because 5 could be in 80% total, so if he bought 2 and 3, he is actually 5% less likely to buy 5, and we have a negative correlation here. That's why you need to look at lift, too. Or other measures like it, there are many.

Related

Algorithm development and optimization

I have this problem:
You need to develop a meal regime based on the data entered by the user. We have a database with the meals and their prices (meals also have a mark whether they are breakfast, cause on breakfast we eat smth different from lunch and dinner most often). The input receives the amount of money (Currency is not important) and the number of days. At the output, we must get a meal regime for a given number of days. Conditions:
Final price does not differ from the given one by more than 3%.
meals mustn't repeat more than once every 5 days.
I found this not effective solution: We are looking for an average price per day = amount of money / number of days. Then, until we reach the given number of days, we iterate throught each breakfast, then lunch and dinner (3 for loops, 2 are nested) and if price is not too different, then we end the search and add this day to the result list. So the design now looks like this:
while(daysCounter < days){
for(){
for(){
for(){
}
}
It looks scary, although there is not a lot of data (number of meals is about 150). There are thoughts that it is possible to find a more effective solution. Also i think about dynamic programming, but so far there are no ideas how to implement it.

Dynamic programming won't work because a necessary part of your state is the meals from the last 5 days. And the number of possibilities for that are astronomical.
However it is likely that there are many solutions, not just a few. And that it is easy to find a working solution by being greedy. Also that an existing solution can be improved fairly easily.
So I'd solve it like this. Replace days with an array of meals. Take the total you want to spend. Divide it up among your meals, in proportion to the median price of the options for that meal. (Breakfast is generally cheaper than dinner.) And now add up that per meal cost to get running totals.
And now for each meal choose the meal you have not had in the last 5 days that brings the running total of what has been spent as close as possible to the ideal total. Choose all of your meals, one at a time.
This is the greedy approach. Normally, but not always, it will come fairly close to the target.
And now, to a maximum of n tries or until you have the target within 3%, pick a random meal, consider all options that are not eaten within the last or next 5 days, and randomly pick one (if any such options exist) that brings the overall amount spent closer to the target.
(For a meal plan that is likely to vary more over a long period, I'd suggest trying simulated annealing. That will produce more interesting results, but it will also take longer.)

Ranking items amongst each other

I'm making a website much like face mash where people can rate a duel between two items how much an items is "better" than the other item. It's essentially just a choice based on some item information that isn't relevant to the algorithm, it's purely based on what a user "believes" is the winner.
A user choice is to select one of these between two random items:
Item 1 much better than item 2. # item 1 win
Item 1 slightly better than item 2. # item 1 win
Item 1 equally good as item 2. # draw
Item 2 slightly better than item 1. # item 2 win
Item 2 much better than item 1. # item 2 win
I would like the grade scale value to be dynamic so if I want I could add another level like "item 1 is superior to item 2" which is a greater win than "much better".
I'm not at all a master of algorithm but I think I would've nailed it if it was just as simple as "item1 win/item2 loss" or "item2 win / item1 loss" win but I really need to grade a win, if it's a BIG win or a small win. I've looked for ranking algorithm for soccer matches but that goal is just to make predictions which isn't what I'm out for.
The goal is to create a ranking amongst ALL items in my set.

AFAIK, implementing something similar to Elo rating system is the most common approach for maintaining the kind of objects' rating you want.
The difference to naively updating the ratings with constant steps is that the Elo system takes into account the current difference between ratings to calculate how they are supposed to be updated.
That is, if A already had a much higher rating than B, and a human votes for A, the rating of A would be increased for only a tiny bit (and the rating of B would be decreased for only a tiny bit), as since A was already known to be much better than B, it was already expected a human would probably rate A over B. On the other hand, if a human votes for B, changes in A and B ratings would be much more severe, as a human contradicts the current rating, so it needs to be adjusted more.
And if ratings for A and B are similar, the ratings would be only adjusted mildly, according to human's decision.

Finding most distinct elements in a set

Say we have a perfume shop that has 100 different perfumes.
Let's say 10,000 customers come in an rate each perfume one through five stars.
Let's say the question is: "how to best construct a pack of 5 perfumes so that 95% customers will give a 4+ star rating for at least one of them"
How to do this algorithmically?
NOTE: I can see that even the question isn't properly formed; there's no guarantee that such a construction even exists. There is a trade-off between 2 parameters.
NOTE: Also, (and this makes the perfume analogy becomes slightly artificial), it doesn't matter whether we get one good match or three good matches. So {4.3, 0, 0, 0, 0} would be equivalent to {4.3, 4.2, 4.2, 4.2, 4.2} -- in both cases the score is 4.3.
Let's say for the purpose of argument that perfumes 0-19 are sweet, perfumes 20-39 are sour, etc (sim. salt, bitter, unami)
So there would be very high crosscorrelation between 0-19.
If you modelled this with 100 points in space, then 0-19 would all attract each other very strongly, they would form a cluster.
Similarly you would get 4 other clusters for the other four tastes.
So from just one metric, we have separated out 5 distinct flavours.
But does this technique extend?
π
PS just giving the names of related techniques would be very helpful, as this would allow me to Google for further information. So any answer that just restates the question in industry accepted terminology would be useful!

This algorithm should find a solution to the problem:
Order the perfumes by the number of customers giving a 4+ rating
Choose the first perfume not concidered yet from the list
Delete the ratings from the customers now satisfied.
Repeat the process for perfumes 2 - 5 in the pack.
Backtrace when neccessary to obtain a selection satisfying the criterion.

The true problem is NP-hard, but you can make use of a greedy algorithm:
Let C be the whole of your customers.
Assign to each perfume a coverage given by the number of customers in C that gave 4+ to each perfume
Sort by descending coverage. If C is empty and all coverages are zero, choose a perfume at random (actually, if C is nonzero but < 5% of the original, your requisite is met)
Remove from C all customers (not ratings) satisfied by the perfume just chosen
Repeat from 2 unless you already have 5 perfumes.
This automatically takes care of taste clustering: a customer giving high marks to sweet perfumes will be satisfied by the most voted sweet perfume, and he will then be struck out from C, all his further ratings ignored, and the algorithm will proceed to satisfy other customers.
Also, you should notice that even if you can't satisfy the requisite (95%, 4+) with five perfumes, perfume similarity will ensure that this algorithm maximizes both the coverage and the marks - so you might end up with, say, (93%, 3.9).
Also, suppose that 10% of users do not give any marks above 3. There's no way that you can 4-satisfy 95% of customers, since 10% of total are at most 3-satisfiable. You might want to build C with customers that actually did give at least one 4+ rating.
Or you could change the algorithm and instead of the one in your question, decide on using a knapsack: you want to take home the highest cumulative rating. This also raises the likelihood of a customer being satisfied by the overall package (as is, he is almost guaranteed to very much like one perfume, but he might strongly dislike the other four).

Algorithm needed - benelux contest 2007

This question (last one) appeared in Benelux Algorithm Programming Contest-2007
http://www.cs.duke.edu/courses/cps149s/spring08/problems/bapc07/allprobs.pdf
Problem Statement in short:
A Company needs to figure out strategy when to - buy OR sell OR no-op on a given input so as to maximise profit. Input is in the form:
6
4 4 2
2 9 3
....
....
It means input is given for 6 days.
Day 1: You get 4 shares, each with price 4$ and at-max you can sell 2 of them
Day 2: You get 2 shares, each with price 9$ and at-max you can sell 3 of them
.
We need to output the maximum profit which can be achieved.
I m thinking about how to go for this problem. It seems to me that if we apply brute force, it will take too much time. If this can be converted to some DP problem like 0-1 Knapsack? Some help will be highly appreciated.

it can be solved by DP
suppose there are n days, and the total number of stock shares is m
let f[i][j] means, at the ith day, with j shares remaining, the maximum profit is f[i][j]
obviously, f[i][j]=maximum(f[i-1][j+k]+k*price_per_day[i]), 0<=k<=maximum_shares_sell_per_day[i]
it can be further optimized that, since f[i][...] only depends on f[i-1][...], a rolling array can be used here. hence u need only to define f[2][m] to save space.
total time complexity is O(n*m*maximum_shares_sell_per_day).
perhaps it can be further optimized to save time. any feedback is welcome

Your description does not quite match the last problem in the PDF - in the PDF you receive the number of shares specified in the first column (or are forced to buy them - since there is no decision to make it does not matter) and can only decide on how many shares to sell. Since it does not say otherwise I presume that short selling is not allowed (else ignore everything except the price and go make so much money on the derivatives market that you afford to both bribe the SEC or congress and retire :-)).
This looks like a dynamic program, where the state at each point in time is the total number of shares you have in hand. So at time n you have an array with one element for each possible number of shares you might have ended up with at that time, and in that element you have the maximum amount of money you can make up to then while ending up with that number of shares. From this you can work out the same information for time n+1. When you reach the end, then all your shares are worthless so the best answer is the one associated with the maximum amount of money.

We can't do better than selling the maximum amount of shares we can on the day with the highest price, so I was thinking: (this may be somewhat difficult to implement (efficiently))
It may be a good idea to calculate the total number of shares received so far for each day to improve the efficiency of the algorithm.
Process the days in decreasing order of price.
For a day, sell amount = min(daily sell limit, shares available) (for the max price day (the first processed day), shares available = shares received to date).
For all subsequent days, shares available -= sell amount. For preceding days, we binary search for (shares available - shares sold) and all entries between that and the day just processed = 0.
We might not need to physically set the values (at least not at every step), just calculate them on-the-fly from the history thus-far (I'm thinking interval tree or something similar).

How to calculate correlation amongst preferences?

I have to split a group of x people into 3 or 4 groups, most likely 3.
I want people to be happy, so I'm having each person rate the other members of the big group from 1 to (x-1).
How do I optimize preferences to create 3 groups?

Here is a method that is likely to get a good arrangement, even if it is not an optimal arrangement:
First create a ranking function that can take any pair of groupings and determine whether one is better than the other. Then apply the following algorithm:
Randomly assign people into groups.
Randomly pick one person from each group.
Create new groupings in which each combination of reassignments is performed on the people chosen in step 2. (For 3 groups there will be 6 such reassignments. For 4, 24.)
Of all possible reasignments, pick the best one.
Repeat steps 2–4 one million times.
UPDATE
If there are only 18 people that need to be assigned, then that's just (18 choose 6) * (12 choose 6) / 6 = 2,858,856 possible groupings. (Or, in the case of four groups it's (18 choose 4) * (14 choose 4) * (10 choose 5) / 4 = 192,972,780 groupings.)
You can just try each one and pick the best.
I guess the ranking algorithm itself is really the hard part of this assignment.
You could just give each person a score based on summing the scores of the people selected to be in their group, then sum the scores of each person together.
The problem is that you're going to end up with all the popular people in one group, and all the unpopular people in another group, and all the telephone handset cleaners in another group.
You should just assign people randomly, and then tell them that you used some really scientific system. That way everybody gets a good mix.

Measure the total satisfaction of a given configuration by calculating the distance between the actual positions and the stated preferences. Start with a randomized set of groups. Then use something like hill climbing or simulated annealing to optimise.
http://en.wikipedia.org/wiki/Hill_climbing
http://en.wikipedia.org/wiki/Simulated_annealing
Simulated annealing sounds complicated, but it's really just a cleverer version of hill-climbing.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Apriori Algorithm- frequent item set generation - algorithm

Related

Algorithm development and optimization

Ranking items amongst each other

Finding most distinct elements in a set

Algorithm needed - benelux contest 2007

How to calculate correlation amongst preferences?

Categories

Resources