Algorithm for scoring user activity - algorithm

I have an application where users can:
Write reviews about products
Add comments to products
Up / Down vote reviews
Up / Down vote comments
Every Up/Down vote is recorded in a db table.
What i want to do now is to create a ranking of the most active users in the last 4 weeks.
Of course good reviews should be weighted more than good comments. But also e.g. 10 good comments should be weighted more than just one good review.
Example:
// reviews created in recent 4 weeks
//format: [ upVoteCount, downVoteCount ]
var reviews = [ [120,23], [32,12], [12,0], [23,45] ];
// comments created in recent 4 weeks
// format: [ upVoteCount, downVoteCount ]
var comments = [ [1,2], [322,1], [0,0], [0,45] ];
// create weight vector
// format: [ reviewWeight, commentsWeight ]
var weight = [0.60, 0.40];
// signature: activties..., activityWeight
var userActivityScore = score(reviews, comments, weight);
... update user table ...
List<Users> users = "from users u order by u.userActivityScore desc";
How would a fair scoring function look like?
How could an implementation of the score() function look like? How to add a weight g to the function so that reviews are weighted heavier? How would such a function look like if, for example, votes for pictures would be added?

These kinds of algorithms can be quite complex. You might want to take a look into these books:
Collective Intelligence in Action, Satnam Alag, Manning, 2008, ISBN 1933988312
Algorithms of the Intelligent Web, Haralambos Marmanis, Manning 2009, ISBN 1933988665
Programming Collective Intelligence: Building Smart Web 2.0 Applications, Toby Segaran, O'Reilly Media 2007, ISBN 0596529325

Related

Power Query to Convert List of Links to Grid of Crossings

In Excel I have a data table of Paired Items that are tagged with an identifier. Essentially, named linkages.
Worksheet: Links
Tag
Point-A
Point-B
Route 1
Home
Office
Route 2
Home
Grocery 1
Happy Hour
Office
Bar
Sad Hour
Office
Dump
Headaches
Bar
Pharmacy
Sick
Bar
Dump
Route 3
Office
Moms
Route 4
Office
Park
Victory
Park
Bar
Discard
Park
Dump
I want to transform this data into a grid of all points in rows and columns with the tag placed at the intersection (Much like old paper road maps with grids for city distances)
Worksheet: Grid
A \ B
Bar
Dump
Grocery 1
Home
Home
Moms
Office
Office
Park
Pharmacy
Bar
Sick
Happy Hour
Victory
Headaches
Dump
Sick
Sad Hour
Discard
Grocery 1
Route 2
Home
Route 1
Home
Route 2
Moms
Route 3
Office
Happy Hour
Sad Hour
Route 1
Route 3
Office
Route 4
Park
Victory
Discard
Route 4
Pharmacy
Headaches
I have written the following M code for transforming, but it seems a bit wayward and overwrought. I am using bit coding of points to construct a join key, so the bitting process will probably break around 32 points.
Is there a shorter set of LETs that do the same transform to grid ?
Is there a way to create a key that is Min(Point-A,Point-B) delimited concatenation with Max(Point-A,Point-B), and thus not rely of bitting?
M code (copied from Advanced Editor)
let
LinksTable = Table.SelectRows(Excel.CurrentWorkbook(), each [Name] = "Links"),
Links = Table.RemoveColumns(Table.ExpandTableColumn(LinksTable, "Content", {"Tag", "Point-A", "Point-B"}), "Name"),
AllPoints = Table.Combine(
{ Table.SelectColumns(Table.RenameColumns(Links,{"Point-A", "Point"}), "Point"),
Table.SelectColumns(Table.RenameColumns(Links,{"Point-B", "Point"}), "Point")
}),
ThePoints = Table.Sort(Table.Distinct(AllPoints),{"Point"}),
PointsIndexed = Table.AddIndexColumn(ThePoints, "Index", 0, 1, Int64.Type),
PointsBitted = Table.RemoveColumns(Table.AddColumn(PointsIndexed, "Bit", each Number.Power(2, [Index]), Int64.Type),"Index"),
AllPairsBitted = Table.Join(
Table.RenameColumns(PointsBitted, {{"Point", "Point-A"}, {"Bit", "Bit-A"}}), {},
Table.RenameColumns(PointsBitted, {{"Point", "Point-B"}, {"Bit", "Bit-B"}}), {},
JoinKind.FullOuter
),
AllPairsKeyed = Table.RemoveColumns(
Table.AddColumn(AllPairsBitted, "BitKeyPair", each Number.BitwiseOr([#"Bit-A"],[#"Bit-B"])),
{ "Bit-A", "Bit-B"}
),
#"Links-A-Bitted" = Table.Join(
Links, "Point-A",
Table.RenameColumns(PointsBitted,{{"Point", "Point-A"}, {"Bit", "Bit-A"}}), "Point-A"
),
#"Links-AB-Bitted" = Table.Join(
#"Links-A-Bitted", "Point-B",
Table.RenameColumns(PointsBitted,{{"Point", "Point-B"}, {"Bit", "Bit-B"}}), "Point-B"
),
LinksKeyed = Table.RemoveColumns(
Table.AddColumn(#"Links-AB-Bitted", "BitKeyLink", each Number.BitwiseOr([#"Bit-A"],[#"Bit-B"])),
{ "Bit-A", "Bit-B"}
),
AllPairsTagged = Table.Sort( Table.RemoveColumns(
Table.Join(
AllPairsKeyed, "BitKeyPair",
Table.SelectColumns(LinksKeyed, {"BitKeyLink", "Tag"}), "BitKeyLink",
JoinKind.LeftOuter
),
{"BitKeyPair", "BitKeyLink"}
),
{"Point-A", "Point-B"}
),
Grid = Table.Pivot(AllPairsTagged, List.Distinct(AllPairsTagged[#"Point-B"]), "Point-B", "Tag", List.First)
in
Grid
I think you can use PIVOT to achieve this. Using directly this functionality would not work because you are looking for symmetry of columns and rows.
The trick is to force that symmetry, appending values from Point-B into values of Point-A.
Steps
Create a secondary table and reorder the columns in the opposite way that the original table, so Tag, Point-B and Point-A.
On the secondary table, rename the columns to Tag, Point-A and Point-B in that order. Append usually take column names literally, so without renaming it would append the names of the same columns.
Pivot on column Point-B without aggregating data.
Reorder the columns using Point-A as a reference, so you have symmetry of columns and rows.
It's worth mentioning that's good practice to Buffer the source table because is used multiple times across the calculations.
Calculation
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("Zc69CsMgEMDxVwnOLv14ghKoS2looEvIcJwXIkEMZxx8+6axiUInwd/99bpOvFxYqDoJKZSztB7PYTBIope7nbPd2SFxXMe/rGCeY6Vc4JxJcQPetAX9Z3Wwc0oJNOBI/hdI0YzAFjCm1uB0yBGldS7lgw9nfWHX0hrgabO3wcVx3K/yirXxCKwzpK/6Dw==", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Tag = _t, #"Point-A" = _t, #"Point-B" = _t]),
BufferedSource = Table.Buffer(Source),
SecondTable = Table.ReorderColumns(BufferedSource,{"Tag","Point-B","Point-A"}),
SecondTableRenameCols = Table.RenameColumns(SecondTable,{{"Point-A","Point-B"},{"Point-B","Point-A"}}),
AppendTables = Table.Combine({BufferedSource,SecondTableRenameCols}),
PivotTables = Table.Pivot(AppendTables, List.Distinct(AppendTables[#"Point-B"]), "Point-B", "Tag"),
ReorderCols = Table.ReorderColumns( PivotTables, PivotTables[#"Point-A"])
in ReorderCols
Output
Point-A
Bar
Dump
Grocery 1
Home
Moms
Office
Park
Pharmacy
Bar
Sick
Happy Hour
Victory
Headaches
Dump
Sick
Sad Hour
Discard
Grocery 1
Route 2
Home
Route 2
Route 1
Moms
Route 3
Office
Happy Hour
Sad Hour
Route 1
Route 3
Route 4
Park
Victory
Discard
Route 4
Pharmacy
Headaches

How do I retrieve data from Oracle from multiple tables and columns without duplicates?

I am trying to retrieve distinct data from 3 different tables.
My query looks like this:
SELECT T.Topic,T.EventNo, T.EventType, T.EventLoc, T.EventDate, T.StartTime, T.EndTime, T.Details, ((ES.SFirstName || ' ' || ES.SLastName))AS SPEAKER
FROM TIMETABLE T
, EXTERNALSPEAKER ES
, SPEAKEREVENT SE
WHERE T.EventNo = SE.EventNo
AND ES.SpeakerID = SE.SpeakerID
AND EventDate >= SYSDATE
ORDER BY EventDate;
The result looks like this:
Normalization by Evaluation for Sized Dependent Types 4 Lecture CH.03.024, FLOOR 1
Normalization by Evaluation for Sized Dependent Types 4 Lecture CH.03.024, FLOOR 1
Careers and Employment Information Workshop 1 Workshop Park Plaza Westminster Bridge Hotel
Object-Oriented Software Design 2 Lecture CH.02.054, FLOOR 3
Doing for our robots what evolution did for us 3 Lecture CH.01.044, FLOOR 4
Doing for our robots what evolution did for us 3 Lecture CH.01.044, FLOOR 4
I have spent hours and I just can't figure it out. I am new to SQL.
Thank you!
We've got too few information to suggest something smart, so here's the most obvious "solution": DISTINCT.
select DISTINCT T.Topic, ...

Is MNL the right model to use when the choice options vary across observations?

In a survey of 100 people, I am asking each person to choose between product A and product B. I ask each person this question 3 times, but each time I present a different set of products. Say, first time, Person 1 is asked to choose between 'Phone 1' and 'Phone 2', given certain attributes of each phone. The second time the choice is again 'Phone 1' vs. 'Phone 2', but a different set of attributes for each phone.
A person is presented three attributes associated with the two phone alternatives every time the question is asked. So, each time between Phone 1 and Phone 2, the attributes of the phone such as cost, memory and camera pixels are presented so that user can choose which set of attributes is most attractive, Phone 1's or Phone 2's.
Overall, 3*100 = 300 responses; 3 responses per person. Each time the attributes cost, memory and camera pixels presented and user asked to choose the feature set they prefer.
My goal is to analyze how users value features of a phone vs. cost of the phone.
In this scenario, can I use a MNL - even though each time I asked the person a question, I only presented two choices ? My understanding is that MNL is sued when (a) there are multiple choices and (b) the choice options do not change across observations, i.e. each person is asked to choose between multiple products, say A, B, C and A, B, C do not change across observations.
In the scenario described above, the two choices varied across the three times the same person was asked the question ? If not MNL, should I rather create a binary logit model given that user only had to choose between two options when the question was asked (even though he was asked the question three times)? If I can use binary logit, should I be concerned that the choice set of products change across observations ? or should I let the attributes defined in each of the rows address the differences in product choices across observations.
I have setup the data as follows (thinking I can do MNL but may be I should set it up differently and use another modeling approach?):
I am working on designing and analyzing similar survey but mine is related to transportation. I am at the beginning level and I am still new to the whole concept, however I will give you an advice and reference maybe it is helpful.
First point: I have come cross 3 models as following from a useful video on YouTube:
MNL refers to Multinomial Logit Model. MNL is used with
alternative-invariant regressors (for example salary of participant
in the survey, or his/her gender …).
Conditional logit model is used with alternative-invariant (gender,
salary, education level …) and alternative-variant regressors (cost
of the product, memory, camera pixel …)
Mixed logit model which uses random parameters. It is also used with
alternative-invariant (gender, salary, education level …) and
alternative-variant regressors (cost of the product, memory, camera
pixel …)
Note regarding alternative-invariant and alternative-variant regressors:
The gender of person participating in the survey will NOT vary between Product A or Product P, so it is alternative-invariant regressor. While price of product could vary between Product A and Product B so it is called alternative-variant regressors.
Based on above I assume you need to use conditional logit model or mixed logit model.
For me I couldn’t find a special function in R for the conditional logit model or mixed logit model. The same mlogit function is used, refer to the examples below for the help of mlogit package:
a pure "multinomial model"
summary(mlogit(mode ~ 0 | income, data = Fish))
a pure "conditional" model
summary(mlogit(mode ~ price + catch, data = Fish))
a "mixed" model
m <- mlogit(mode ~ price+ catch | income, data = Fish)
summary(m)
same model with charter as the reference level
m <- mlogit(mode ~ price+ catch | income, data = Fish, reflevel = "charter")
From the examples above, I think (but NOT sure) that in the Manual of mlogit package, they refer to mixed logit when you used both alternative-invariant and alternative-variant regressors. While conditional model when you have only alternative-variant regressors. On the other hand, multinomial model when you have only multinomial alternative-invariant regressors.
Second point: There is something called “panel data” when you are asking the same person to choose one product for each choice-set. Same person here means that in your model you are taking into consideration, the gender, the salary, the education level … which they will stay the same for the same person. Check this: https://en.wikipedia.org/wiki/Panel_data
To use panel techniques please refer to help in mlogit package in R. I am quoting from it the following:
“panel only relevant if rpar is not NULL and if the data are repeated observations of the same unit ; if TRUE, the mixed-logit model is estimated using panel techniques”
So in my understanding, if you want to use the panel techniques you have to use random draws because panel will be true and rpar will not be NULL.
Moreover, for example about using the panel data, please refer to the below example from “Estimation of multinomial logit models in R : The mlogit Packages” by Yves Croissant
data("Train", package = "mlogit")
Tr <- mlogit.data(Train, shape = "wide", varying = 4:11, choice = "choice", sep = "_", opposite = c("price", "time", "change", "comfort"), alt.levels=c("A", "B"), id.var ="id")
Train.ml <- mlogit(choice ~ price + time + change + comfort, Tr)
Train.mxlc <- mlogit(choice ~ price + time + change + comfort, Tr, panel = TRUE, rpar = c(time = "cn", change = "n", comfort = "ln"), correlation = TRUE, R = 100, halton = NA)
Train.mxlu <- update(Train.mxlc, correlation = FALSE)
I hope that is useful to you.

Linq to Entities: complex query getting "average" restaurant rating

So I'm building a Restaurant Review site for my community. I need to
extract data from the following tables: RESTAURANT, CUISINE, CITY,
PRICE and RATING (customer ratings).
The query should return all restuarants of a selected CUISINE_ID and
return the RESTAURANT_NAME, CUSINE_NAME, CUTY_NAME, PRICE_CODE and it
should average all the reviews RATING_CODE and return a calculated
value. I'm fine with returning all the data except the average
rating.
I've only been working with LINQ to Entities 2 days and LINQ for about
3 weeks, so I'm really a newbie; I'm waiting for my LINQ book to be
delivered from Amazon.com. Your help guidance be appreciated!
It should end up looking something like this:
var avgForMatches =
(from r in context.Restaurants
where r.Cuisines.Any(c => c.CuisineName == cuisineName)
where r.Prices.Any(p => p.PriceCode == priceCode)
//... same pattern for other searches.
select r.RatingCode)
.Average();
Read about aggregate methods (including average) within the 101 linq samples - http://msdn.microsoft.com/en-us/vcsharp/aa336747

Best clustering algorithm? (simply explained)

Imagine the following problem:
You have a database containing about 20,000 texts in a table called "articles"
You want to connect the related ones using a clustering algorithm in order to display related articles together
The algorithm should do flat clustering (not hierarchical)
The related articles should be inserted into the table "related"
The clustering algorithm should decide whether two or more articles are related or not based on the texts
I want to code in PHP but examples with pseudo code or other programming languages are ok, too
I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.
APPROACH 1
<?php
$zeit = time();
function check($str1, $str2){
$minprozent = 60;
similar_text($str1, $str2, $prozent);
$prozent = sprintf("%01.2f", $prozent);
if ($prozent > $minprozent) {
return TRUE;
}
else {
return FALSE;
}
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
$rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
$rel2 = mysql_query($rel1);
$rel2a = mysql_num_rows($rel2);
if ($rel2a > 0) {
while ($rel3 = mysql_fetch_assoc($rel2)) {
if (check($sql3['text'], $rel3['text']) == TRUE) {
$id_a = $sql3['id'];
$id_b = $rel3['id'];
$rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
$rein2 = mysql_query($rein1);
$rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
$rein4 = mysql_query($rein3);
}
}
}
}
?>
APPROACH 2 [only check()]
<?php
function square($number) {
$square = pow($number, 2);
return $square;
}
function check($text1, $text2) {
$words_sub = text_splitter($text2); // splits the text into single words
$words = text_splitter($text1); // splits the text into single words
// document 1 start
$document1 = array();
foreach ($words as $word) {
if (in_array($word, $words)) {
if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
}
}
$rating1 = 0;
foreach ($document1 as $temp) {
$rating1 = $rating1+square($temp);
}
$rating1 = sqrt($rating1);
// document 1 end
// document 2 start
$document2 = array();
foreach ($words_sub as $word_sub) {
if (in_array($word_sub, $words)) {
if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
}
}
$rating2 = 0;
foreach ($document2 as $temp) {
$rating2 = $rating2+square($temp);
}
$rating2 = sqrt($rating2);
// document 2 end
$skalarprodukt = 0;
for ($m=0; $m<count($words)-1; $m++) {
$skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
}
if (($rating1*$rating2) == 0) { continue; }
$kosinusmass = $skalarprodukt/($rating1*$rating2);
if ($kosinusmass < 0.7) {
return FALSE;
}
else {
return TRUE;
}
}
?>
I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.
I hope you can help me. Thanks in advance!
The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.
First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:
for word in article.text
article.histogram[indexLookup[word]]++
Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:
def check(articleA, articleB)
rtn = 0
for a,b in zip(articleA.histogram, articleB.histogram)
rtn += a*b
return rtn > threshold
(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)
This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.
Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.
By the way, this technique is described in more detail here
Maybe clustering is the wrong strategy here?
If you want to display similar articles, use similarity search instead.
For text articles, this is well understood. Just insert your articles in a text search database like Lucene, and use your current article as search query. In Lucene, there exists a query called MoreLikeThis that performs exactly this: find similar articles.
Clustering is the wrong tool, because (in particular with your requirements), every article must be put into some cluster; and the related items would be the same for every object in the cluster. If there are outliers in the database - a very likely case - they could ruin your clustering. Furthermore, clusters may be very big. There is no size constraint, the clustering algorithm may decide to put half of your data set into the same cluster. So you have 10000 related articles for each article in your database. With similarity search, you can just get the top-10 similar items for each document!
Last but not least: forget PHP for clustering. It's not designed for this, and not performant enough. But you can probably access a lucene index from PHP well enough.
I believe you need to make some design decisions about clustering, and continue from there:
Why are you clustering texts? Do you want to display related documents together? Do you want to explore your document corpus via clusters?
As a result, do you want flat or hierarchical clustering?
Now we have the complexity issue, in two dimensions: first, the number and type of features you create from the text - individual words may number in the tens of thousands. You may want to try some feature selection - such as taking the N most informative words, or the N words appearing the most times, after ignoring stop words.
Second, you want to minimize the number of times you measure similarity between documents. As bubaker correctly points out, checking similarity between all pairs of documents may be too much. If clustering into a small number of clusters is enough, you may consider K-means clustering, which is basically: choose an initial K documents as cluster centers, assign every document to the closest cluster, recalculate cluster centers by finding document vector means, and iterate. This only costs K*number of documents per iteration. I believe there are also heuristics for reducing the needed number of computations for hierarchical clustering as well.
What does the similar_text function called in Approach #1 look like? I think what you're referring to isn't clustering, but a similarity metric. I can't really improve on the White Walloun's :-) histogram approach - an interesting problem to do some reading on.
However you implement check(), you've got to use it to make at least 200M comparisons (half of 20000^2). The cutoff for "related" articles may limit what you store in the database, but seems too arbitrary to catch all useful clustering of texts,
My approach would be to modify check() to return the "similarity" metric ($prozent or rtn). Write the 20K x 20K matrix to a file and use an external program to perform a clustering to identify nearest neighbors for each article, which you could load into the related table. I would do the clustering in R - there's a nice tutorial for clustering data in a file running R from php.

Resources