Apriori Algorithm - algorithm

I've heard about the Apriori algorithm several times before but never got the time or the opportunity to dig into it, can anyone explain to me in a simple way the workings of this algorithm? Also, a basic example would make it a lot easier for me to understand.

Apriori Algorithm
It is a candidate-generation-and-test approach for frequent pattern mining in datasets. There are two things you have to remember.
Apriori Pruning Principle - If any itemset is infrequent, then its superset should not be generated/tested.
Apriori Property - A given (k+1)-itemset is a candidate (k+1)-itemset only if everyone of its k-itemset subsets are frequent.
Now, here is the apriori algorithm in 4 steps.
Initially, scan the database/dataset once to get the frequent 1-itemset.
Generate length k+1 candidate itemsets from length k frequent itemsets.
Test the candidates against the database/dataset.
Terminate when no frequent or candidate set can be generated.
Solved Example
Suppose there is a transaction database as follows with 4 transactions including their transaction IDs and items bought with them. Assume the minimum support - min_sup is 2. The term support is the number of transactions in which a certain itemset is present/included.
Transaction DB
tid | items
-------------
10 | A,C,D
20 | B,C,E
30 | A,B,C,E
40 | B,E
Now, let's create the candidate 1-itemsets by the 1st scan of the DB. It is simply called as the set of C_1 as follows.
itemset | sup
-------------
{A} | 2
{B} | 3
{C} | 3
{D} | 1
{E} | 3
If we test this with min_sup, we can see {D} does not satisfy the min_sup of 2. So, it will not be included in the frequent 1-itemset, which we simply call as the set of L_1 as follows.
itemset | sup
-------------
{A} | 2
{B} | 3
{C} | 3
{E} | 3
Now, let's scan the DB for the 2nd time, and generate candidate 2-itemsets, which we simply call as the set of C_2 as follows.
itemset | sup
-------------
{A,B} | 1
{A,C} | 2
{A,E} | 1
{B,C} | 2
{B,E} | 3
{C,E} | 2
As you can see, {A,B} and {A,E} itemsets do not satisfy the min_sup of 2 and hence they will not be included in the frequent 2-itemset, L_2
itemset | sup
-------------
{A,C} | 2
{B,C} | 2
{B,E} | 3
{C,E} | 2
Now let's do a 3rd scan of the DB and get candidate 3-itemsets, C_3 as follows.
itemset | sup
-------------
{A,B,C} | 1
{A,B,E} | 1
{A,C,E} | 1
{B,C,E} | 2
You can see that, {A,B,C}, {A,B,E} and {A,C,E} does not satisfy min_sup of 2. So they will not be included in frequent 3-itemset, L_3 as follows.
itemset | sup
-------------
{B,C,E} | 2
Now, finally, we can calculate the support (supp), confidence (conf) and lift (interestingness value) values of the Association/Correlation Rules that can be generated by the itemset {B,C,E} as follows.
Rule | supp | conf | lift
-------------------------------------------
B -> C & E | 50% | 66.67% | 1.33
E -> B & C | 50% | 66.67% | 1.33
C -> E & B | 50% | 66.67% | 1.77
B & C -> E | 50% | 100% | 1.33
E & B -> C | 50% | 66.67% | 1.77
C & E -> B | 50% | 100% | 1.33

See Top 10 algorithms in data mining (free access) or The Top Ten Algorithms in Data Mining. The latter gives a detailed description of the algorithm, together with details on how to get optimized implementations.

Well, I would assume you've read the wikipedia entry but you said "a basic example would make it a lot easier for me to understand". Wikipedia has just that so I'll assume you haven't read it and suggest that you do.
Read the wikipedia article.

The best introduction to Apriori can be downloaded from this book:
http://www-users.cs.umn.edu/~kumar/dmbook/index.php
you can download the chapter 6 for free which explain Apriori very clearly.
Moreover, if you want to download a Java version of Apriori and other algorithms for frequent itemset mining, you can check my website:
http://www.philippe-fournier-viger.com/spmf/

Related

What is the point of choosing closest node in Dijkstra algorithm?

In all articles which I read, neighbor to process first is "closest" neighbor. But finally it's needed to visit all nodes to figure out all possible paths. So, the question is - why do we do this? I believe the same result can be achieved if we simply traverse Graph in BFS way and will perform calculation of costs. For example:
first step- 0, costs table:
2 - 6 |
3 - 2 |
second step- 2, costs table:
2 - 6 |
3 - 2 |
1 - 9 |
third step- 3, costs table:
2 - 6 |
3 - 2 |
1 - 9 |
4 - 12 |
forth step- 1, costs table:
2 - 6 |
3 - 2 |
1 - 9 |
4 - 12 |
5 - 12 |
fifth step- 4, costs table:
2 - 6 |
3 - 2 |
1 - 9 |
4 - 12 |
5 - 12 |
With simple BFS traversing the cheapest way was find out. What do I missing?
Suppose the path from A to B and B to C are both cost 1, and the direct route from A to C is cost 3. (In the real world, the first two are highways that bypass a mountain while the third is a tiny trail over a mountain pass.)
Dijkstra will route you A -> B -> C for a total cost of 2 while BFS will route you A -> C for a total cost of 3.
Therefore you have to process lowest cost first to get the right answer.
At each step, Dijkstra's algorithm extends the lowest-cost path known so far. Thus, when you finally encounter the goal state, you know that all other, unfinished paths have a greater cost. Therefore, the one you just found is the shortest path.

Verify wether the following answer is correct?

I am asked to write the grammar which generate the following language over the alphabet Z={a,b}
M={w | numbers of b's in w is 3 modulo 4}
My Answer is
S->bP| bJ | aS
P->bQ | bK | aP
Q->bR | bL | aQ
R->bS | e | aS
L->e
will this work?
Can we make it better?
Not sure what J, K and L are. But yes, you can probably do better; a DFA with 4 states can recognize your language, so there's definitely a regular grammar with four nonterminals:
S -> aS | bR
R -> aR | bT
T -> aT | bU | b
U -> aU | bS | a
This works because states S, R, T and U correspond to having seen 0, 1, 2 and 3 instances of b, modulo 4. Seeing instances of a leaves you in whatever state you were in before, while seeing b takes you to the next state. Because state u is accepting we add U -> e, and then remove the empty production in the usual way.

Normalize 5-star rating to make rating more uniform

I have a system where people rate different items on a scale from 0-5. The issue is, not everyone rates the same items, and the scoring is not objective. The goal is to achieve a fair comparison between items so that an item's score is not affected too much if one of the scorers is very "lenient" or "harsh." In actuality, there may be 100 items, each one scored twice, but here is an example dataset where 4 people scored 12 items, each one being scored twice:
| Item | Score 1 | Score 2 |
_____
1 | 5 | | 4 |
2 | 5 | | 3 | C
3 | 4 | |_2_|
4 A | 5 | | 5 |
5 | 3 | | 0 |
6 |_5_| | 3 |
7 | 3 | | 1 | D
8 | 4 | | 1 |
9 B | 4 | |_2_|
10 | 4 | | 3 |
11 | 4 | | 3 | C
12 |_5_| | 4 |
In this table, the boxes represent a single person's set of scores. We can label the person who gave score 1 to items 1-6 person A, the one who gave score 1 to 7-12 person B, the one who gave score 2 to 1-3 and 10-12 person C, and the one who gave score 2 to to 4-9 person D.
Informally, if we assume person C was the closest to each item's objective score, we might reason as follows:
Person A generally gave higher scores than C on items 1-3 so he is "lenient."
D gave low scores to all of them except for item 4 which then must have been truly good. He gave scores generally lower than A, so his scores should be adjusted slightly upwards perhaps.
B gave higher scores than D, and a bit higher than C, so a bit "lenient".
Thus, we might produce adjusted scores for each item. For example, even though item 2 has a higher average score than item 9, they are probably on par considering A is generally lenient and D is generally harsh. The question is, how do we do this programmatically. I thought we might make several transformation functions which transform a raw score into an adjusted score, say A, B, C, and D. For example, we might have A(5)=3.7 because when A rates an item as 5, it is really in the 3-4 range. Then, we want to minimize
|A(x_0a)-C(x_0c)|^2 + |D(x_1d)-A(x_1a)|^2 + |B(x_2b)-D(x_2d)|^2 + |C(x_3c)-B(x_3b)|^2
where x_ip is a vector which consists of person p's ratings for items 3i+1, 3i+2, and 3i+3. We might make A, B, C, and D linear transformations, for example. How then do you optimize it? And is this the best way to eliminate one the harshness or leniency of scorers without throwing away their ratings?

Minimum cost within limited time for a timetable?

I have a timetable like this:
+-----------+-------------+------------+------------+------------+------------+-------+----+
| transport | trainnumber | departcity | arrivecity | departtime | arrivetime | price | id |
+-----------+-------------+------------+------------+------------+------------+-------+----+
| Q | Q00 | BJ | TJ | 13:00:00 | 15:00:00 | 10 | 1 |
| Q | Q01 | BJ | TJ | 18:00:00 | 20:00:00 | 10 | 2 |
| Q | Q02 | TJ | BJ | 16:00:00 | 18:00:00 | 10 | 3 |
| Q | Q03 | TJ | BJ | 21:00:00 | 23:00:00 | 10 | 4 |
| Q | Q04 | HA | DL | 06:00:00 | 11:00:00 | 50 | 5 |
| Q | Q05 | HA | DL | 14:00:00 | 19:00:00 | 50 | 6 |
| Q | Q06 | HA | DL | 18:00:00 | 23:00:00 | 50 | 7 |
| Q | Q07 | DL | HA | 07:00:00 | 12:00:00 | 50 | 8 |
| Q | Q08 | DL | HA | 15:00:00 | 20:00:00 | 50 | 9 |
| ... | ... | ... | ... | ... | ... | ... | ...|
+-----------+-------------+------------+------------+------------+------------+-------+----+
In this table, there 13 cities and 116 routes altogether and the smallest unit of time is half an hour.
There are difference transports, which doesn't matter. As you can see, there can be multiple edges with same departcity and arrivecity but difference time and difference price. The time is constant everyday.
Now, here arises a problem.
A user wonder how he can travel from city A to city B (A and B may be one city), with passing zero or some cities C, D...(whether they should be in order depends on whether the user wants it to be, that is, there are two problems), within X hours and also least costs under above conditions.
Before this problem, I have solved another simpler problem.
A user wonder how he can travel from city A to city B (A and B may be one city), with passing zero or some cities C, D...(whether they should be in order depends), with least costs under above conditions.
Here is how I solve it (just take not in order as an example):
Sort the must-pass cities:C1, C2, C3...Cn. Let C0 = A, C(n+1) = B, minCost.cost = INFINITE;
i = 0, j = 1, W = {};
Find a least cost way S from Ci to Cj using Dijkstra Algorithm with price as the weight of edges. W=W∪S;
i = i + 1, j = j + 1;
If j <= n + 1, goto 3;
if W.cost < minCost.cost, minCost = W;
If next permutation for C1...Cn exists, rearrange list C1...Cn in order of the next permutation for C1...Cn and goto 2;
Return minCost;
However, I cannot come up with a efficient solution to the first problem, Please help me, thanks.
I'll be appreciated if anyone can solve another problem:
A user wonder how he can travel from city A to city B (A and B may be one city), with passing zero or some cities C, D...(whether they should be in order depends), within least time under above conditions.
It's quite a big problem, so I will just sketch a solution.
First, remodel your graph as follows. Instead of each vertex representing a city, let a vertex represent a tuple of (city, time). This is feasible as there are only 13 cities and only (time_limit - current_time) * 2 possible time slots as the smallest unit of time is half an hour. Now connect vertices according to the given timetable with prices as their weights as before. Don't forget that the user can stay at any city for any amount of time for free. All nodes with city A are start nodes, all nodes with city B are target nodes. Take the minimum value of all (B, time) vertices to get the solution with least cost. If there are multiple, take the one with the smallest time.
Now on towards forcing the user to pass through certain cities in order. If there are n cities to pass through (plus start and target city), you need n+2 copies of the same graph which act as different levels. The level represents how many cities of your list you have already passed. So you start in level 0 on vertex A. Once you get to C1 in level 0 you move to the vertex C1 in level 1 of the graph (connect the vertices by 0-weight edges). This means that when you are in level k, you have already passed cities C1 to Ck and you can get to the next level only by going through C(k+1). The vertices of city B in the last level are your target nodes.
Note: I said copies of the same graph, but that is not exactly true. You can't allow the user to reach C(k+2), ..., B in level k, that would violate the required order.
To enforce passing cities in any order, a different scheme of connecting the levels (and modifying them during runtime) is required. I'll leave this to you.

Tic Tac Toe heuristic AI [duplicate]

This question already has answers here:
What algorithm for a tic-tac-toe game can I use to determine the "best move" for the AI?
(10 answers)
Closed 7 years ago.
I designed simple AI for 3x3 Tic Tac Toe game. However I didn't want to do neither complete search, nor MinMax. Instead i thought of a heuristic that would evaluate values for all of 9 fields, and then AI would choose the field with highest value. The problem is, I have absolutely no idea how to determine, whether it is a perfect (unbeatable) algorithm.
And here are the details:
Every Field has several WinPaths on the grid. Middle one has 4 (horizontal, vertical and two diagonals), corners have 3 each (horizontal, diagonal and one vertical), sides have only 2 each (horizontal and vertical). Value of each Field equals sum of its WinPaths values. And WinPath value depends on its contents:
Empty: [ | | ] - 1 point
One symbol: [X| | ] - 10 points // can be any symbol in any place
Two different symbols: [X|O| ] - 0 points // they can be arranged in any possible way
Two identical opponents symbols: [X|X| ] - 100 points // arranged in any of three ways
Two identical "my" symbols: [O|O| ] - 1000 points // arranged in any of three ways
This way for example beginning situation has values as below:
3 | 2 | 3
---+---+---
2 | 4 | 2
---+---+---
3 | 2 | 3
However a later one can be like this (X is moving now):
X | 10| O
---+---+---
O | O |110
---+---+---
X | 20| 20
So is there any reliable way to find out whether is this a perfect algorithm, or does it have any disadvantages?
PS. I was trying (from the perspective of player) to create a fork situation so I could beat this AI, but I have failed.
Wikipedia: tic-tac-toe says that there are only 362,880 possible tic-tac-toe games. A brute force approach to proving your algorithm would be to exhaustively search the game tree, having your opponent try each possible move at each turn, and see if your algorithm ever loses (it's guaranteed a win or draw if perfect). The space is small enough that a program could do this very quickly. Of course, you would then be faced with proving that your test program is correct.
To know if your bot good enough you have to play a lot games bot vs top players and bot vs the best bots in the market (usually for much complicated game like chess or go).
Did you tried this (I play first)?
13| 12| 13
---+---+---
12| 14| 12
---+---+---
13| 12| O
Right?
| |
---+---+---
| X |
---+---+---
| | O
O |20 |30
---+---+---
20| X |20
---+---+---
30|20 | O
If I understand it well the X next move will be on the corner
O |20 | X
---+---+---
20| X |20
---+---+---
30|20 | O
And from here I win..
If this pass (if I missed something..) so your solution look perfect

Resources