I have a site where customers purchase items that are tagged with a variety of taxonomy terms. I want to create a group of customers who might be interested in the same items by considering the tags associated with purchases they've made. Rather than comparing a list of tags for each customer each time I want to build the group, I'm wondering if I can use some type of scoring to solve the problem.
The way I'm thinking about it, each tag would have some unique number assigned to it. When I perform a scoring operation it would render a number that could only be achieved by combining a specific set of tags.
I could update a customer's "score" periodically so that it remains relevant.
Am I on the right track? Any ideas?
Your description of the problem looks much more like a clustering or recommendation problem. I am not sure if those tags are enough of an information to use clustering or recommendation tough.
Your idea of the score doesn't look promising to me, because the same sum could be achieved in several ways, if those numbers aren't carefully enough chosen.
What I would suggest you:
You can store tags for each user. When some user purchases a new item, you will add the tags of the item to the user's tags. On periodical time you will update the users profiles. Let's say we have users A and B. If at the time of the update the similarity between A and B is greater than some threshold, you will add a relation between the users which will indicate that the two users are similar. If it's lower you will remove the relation (if previously they were related). The similarity could be either a number of common tags or num_common_tags / num_of_tags_assigned_either_in_A_or_B.
Later on, when you will want to get users with particular set of tags, you will just do a query which checks which users have that set of tags. Also you can check for similar users to given user, just by looking up which users are linked with the user in question.
If you assign a unique power of two to each tag, then you can sum the values corresponding to the tags, and users with the exact same sets of tags will get identical values.
red = 1
green = 2
blue = 4
yellow = 8
For example, only customers who have the set of { red, blue } will have a value of 5.
This is essentially using a bitmap to represent a set. The drawback is that if you have many tags, you'll quickly run out of integers. For example, if your (unsigned) integer type is four bytes, you'd be limited to 32 tags. There are libraries and classes that let you represent much larger bitsets, but, at that point, it's probably worth considering other approaches.
Another problem with this approach is that it doesn't help you cluster members that are similar but not identical.
Related
Sort of a very long winded explanation of what I'm looking at so I apologize in advance.
Let's consider a Recipe:
Take the bacon and weave it ...blahblahblah...
This recipe has 3 Tags
author (most important) - Chandler Bing
category (medium importance) - Meat recipe (out of meat/vegan/raw/etc categories)
subcategory (lowest importance) - Fast food (our of fast food / haute cuisine etc)
I am a new user that sees a list of randomly sorted recipes (my palate/profile isn't formed yet). I start interacting with different recipes (reading them, saving them, sharing them) and each interaction adds to my profile (each time I read a recipe a point gets added to the respective category/author/subcategory). After a while my profile starts to look something like this :
Chandler Bing - 100 points
Gordon Ramsey - 49 points
Haute cuisine - 12 points
Fast food - 35 points
... and so on
Now, the point of all this exercise is to actually sort the recipe list based on the individual user's preferences. For example in this case I will always see Chandler Bing's recipes on the top (regardless of category), then Ramsey's recipes. At the same time, Bing's recipes will be sorted based on my preferred categories and subcategories, seeing his fast food recipes higher than his haute cuisine ones.
What am I looking at here in terms of a sorting algorithm?
I hope that my question has enough information but if there's anything unclear please let me know and I'll try to add to it.
I would allow the "Tags" with the most importance to have the greatest capacity in point difference. Example: Give author a starting value of 50 points, with a range of 0-100 points. Give Category a starting value of 25 points, with a possible range of 0-50 points, give subcategory a starting value of 12.5 points, with a possible range of 0-25 points. That way, if the user's palate changes over time, s/he will only have to work down from the maximum, or work up from the minimum.
From there, you can simply add up the points for each "Tag", and use one of many languages' sort() methods to compare each recipe.
You can write a comparison function that is used in your sort(). The point is when you're comparing two recipes just add up the points respectively based on their tags and do a simple comparison. That and whatever sorting algorithm you choose should do just fine.
You can use a recursively subdividing MSD (sort of radix sort algorithm). Works as follows:
Take the most significant category of each recipe.
Sort the list of elements based on that category, grouping elements with the same category into one bucket (Ramsay bucket, Bing bucket etc).
Recursively sort each bucket, starting with the next category of importance (Meat bucket etc).
Concatenate the buckets together in order.
Complexity: O(kn) where k is the number of category types and N is the number of recipes.
I think what you're looking for is not a sorting algorithm, but a rating scheme.
You say, you want to sort by preferences. Let's assume, these preferences have different “dimensions”, like level of complexity, type of cuisine, etc.
These dimensions have different levels of measurement. These can be e.g. numeric or simple categories/tags. It would be your job to:
Create a scheme of dimensions and scales that can represent a user's preferences.
Operationalize real-world data to fit into this scheme.
Create a profile for the users which reflects their preferences. Same for the chefs; treat them just like normal users here.
To actually match a user to a chef (or, even to another user), create a sorting callback that matches all your dimensions against each other and makes sure that in each of the dimension the compared users have a similar value (on a numeric scale), or an overlapping set of properties (on a nominal scale, like tags). Then you sort the result by the best match.
I am trying to build an application that has a smart adaptive search engine (lets say for cars). If I search for for 4x4 then the DB will return all the 4x4 cars I have (100 cars) - but as time goes by and I start checking out cars, liking them, commenting on them, etc the order of the search result should be the different. That means 1 month later when searching for 4x4, I should get the same result set ordered differently as per my previous interaction with the site. If I was mainly liking and commenting on German cars, BMW should be on the top and Land cruiser should be further down.
This ranking should be based on attributes that I captureduring user interaction (eg: car origin, user age, user location, car type[4x4, coupe, hatchback], price range). So for each car result I get, I will be weighing it based on how well it is performing on the 5 attributes above.
I intend to use the DB just as a repository and do the ranking and the thinking on the server. My question is, what kind of algorithm should I be using to weigh/rank my search result?
Thanks.
You're basically saying that you already have several ordering schemes:
Keyword search result
amount of likes for car's category
likely others, such as popularity, some form of date, etc.
What you do then is make up a new scheme, call it relevance:
relevance = W1 * keyword_score + W2*likes_score + ...
and sort by relevance. Experiment with the weights W1, W2, ..., until you get something you find useful.
From my understanding search engines work on this principle. It's been long thrown around that Google has on the order of 200 different inputs into the relevance score, PageRank being just one. The beauty of this approach is that it lets you fine tune the importance of everything (even individually for every query), and it lets you add additional inputs without screwing everything up.
I am in the process of merging two data sets together in Stata and came up with a potential concern.
I am planning on sorting each data set in exactly the same manner on several categorical variables that are common to both sets of data. HOWEVER, several of the categorical variables have more categories present in one data set over the other. I have been careful enough to ensure that the coding matches up in both data sets (e.g. Red is coded as 1 in both data set A and B, but data set A has only Red, Green and Blue whereas data set B has Red, Green, Blue, and Yellow).
If I were to sort each data set the same way and generate an id variable (gen id = _n) and merge on that, would I run into any problems?
There is no statistical question here, as this is purely about data management in Stata, so I too shall shortly vote for this to be migrated to Stack Overflow, where I would be one of those who might try to answer it, so I will do that now.
What you describe to generate identifiers is not how to think of merging data sets, regardless of any of the other details in your question.
Imagine any two data sets, and then in each data set, generate an identifier that is based on the observation numbers, as you propose. Generating such similar identifiers does not create a genuine merge key. You might as well say that four values "Alan" "Bill" "Christopher" "David" in one data set can be merged with "William" "Xavier" "Yulia" "Zach" in another data set because both can be labelled with observation numbers 1 to 4.
My advice is threefold:
Try what you are proposing with your data and try to understand the results.
Consider whether you have something else altogether, namely an append problem. It is quite common to confuse the two.
If both of those fail, come back with a real problem and real code and real results for a small sample, rather than abstract worries.
I think I may have solved my problem - I figured I would post an answer specifically relating to the problem in case anybody has the same issue.
~~
I have two data sets: One containing information about the amount of time IT help spent at a customer and another data set with how much product a customer purchased. Both data sets contain unique ID numbers for each company and the fiscal quarter and year that link the sets together (e.g. ID# 1001 corresponds to the same company in both data sets). Additionally, the IT data set contains unique ID numbers for each IT person and the customer purchases data set contains a unique ID number for each purchase made. I am not interested in analysis at the individual employee level, so I collapsed the IT time data set to the total sum of time spent at a given company regardless of who was there.
I was interested in merging both data sets so that I could perform analysis to estimate some sort of "responsiveness" (or elasticity) function linking together IT time spent and products purchased.
I am certain this is a case of "merging" data because I want to add more VARIABLES not OBSERVATIONS - that is, I wish to horizontally elongate not vertically elongate my final data set.
Stata 12 has many options for merging - one to one, many to one, and one to many. Supposing that I treat my IT time data set as my master and my purchases data set as my merging set, I would perform a "m:1" or many to one merge. This is because I have MANY purchases corresponding to one observation per quarter per company.
I am trying to solve a problem of a dating site. Here is the problem
Each user of app will have some attributes - like the books he reads, movies he watches, music, TV show etc. These are defined top level attribute categories. Each of these categories can have any number of values. e.g. in books : Fountain Head, Love Story ...
Now, I need to match users based on profile attributes. Here is what I am planning to do :
Store the data with reverse indexing. i.f. Each of Fountain Head, Love Story etc is index key to set of users with that attribute.
When a new user joins, get the attributes of this user, find which index keys for this user, get all the users for these keys, bucket (or radix sort or similar sort) to sort on the basis of how many times a user in this merged list.
Is this good, bad, worse? Any other suggestions?
Thanks
Ajay
The algorithm you described is not bad, although it uses a very simple notion of similarity between people.
Let us make it more adjustable, without creating a complicated matching criteria. Let's say people who like the same book are more similar than people who listen to the same music. The same goes with every interest. That is, similarity in different fields has different weights.
Like you said, you can keep a list for each interest (like a book, a song etc) to the people who have that in their profile. Then, say you want to find matches of guy g:
for each interest i in g's interests:
for each person p in list of i
if p and g have mismatching sexual preferences
continue
if p is already in g's match list
g->match_list[p].score += i->match_weight
else
add p to g->match_list with score i->match_weight
sort g->match_list based on score
The choice of weights is not a simple task though. You would need a lot of psychology to get that right. Using your common sense however, you could get values that are not that far off.
In general, matching people is much more complicated than summing some scores. For example a certain set of matching interests may have more (or in some cases less) effect than the sum of them individually. Also, an interest in one may totally result in a rejection from the other no matter what other matching interest exists (Take two very similar people that one of them loves and the other hates twilight for example)
I'm writing a web app similar to wtfimages.com in that one visitor should never (or rarely) see the same thing twice, but different visitors can see the same thing. Ideally, this would span visits, so that when Bob comes back tomorrow he doesn't see today's things again either.
Three first guesses:
have enough unique things that it's unlikely any user will draw enough items to repeat
actually track each user somehow and log what he has seen
have client-side Javascript request things by id according to a pseudorandom sequence seeded with something unique to the visitor and session (e.g., IP and time)
Edit: So the question is, which of these three is the best solution? Is there a better one?
Note: I suspect this question is the web 2.0 equivalent of "how do I implement strcpy?", where everybody worth his salt knows K&R's idiomatic while(*s++ = *t++) ; solution. If that's the case, please point me to the web 2.0 K&R, because this specific question is immaterial. I just wanted a a "join the 21st century" project to learn CGI scripting with Python and AJAX with jQuery.
The simplest implementation I can think of would be to make a circular linked list, and then start individual users at random offsets in the linked list. You are guaranteed that they will see every image there is to see before they will see any image twice.
Technically, it only needs to be a linked list in a conceptual sense. For example, you could just use the database identifiers of the various items and wrap around once you've hit the last one.
There are complexity problems with other solutions. For example, if you want it to be a different order for each person, that requires permuting the elements in some way. But then you have to store that permutation, so as to guarantee that people see things in different orders. That's going to take up a lot of space. It will also require you to update everybody's permutations if you add or remove an image to the list of things to see, which is yet more work.
A compromise solution that still allows you to guarantee a person sees every image before they see any image twice while still varying things among people might be something like this:
Using some hash function H (say, MD5), take the hash of each image, and store the image with a filename equal to the digest (e.g. 194db8c5[...].jpg).
Decide on a number N. This will be the number of different paths that a randomly selected person could take to traverse all the images. For example, if you pick N = 10, each person will take one of 10 possible distinct journeys through the images. Don't pick an N larger than the digest size of H (for MD5, this is 16; for SHA-1, it's 64).
Make N different permutations of the image list, with the ith such permutation being generated by rotating the characters in each file name i characters to the left, and then sorting all the entries. (For example, a file originally named abcdef with i == 4 will become efabcd. Now sort all the files that have been transformed in this way, and you have a distinct list.)
Randomly assign to each user a number r from 0 .. N - 1 inclusive. They now see the images in the ordering specified by r.
Ultimately, this seems like a lot of work. I'd say just suck it up and make it random, accept that people will occasionally see the same image again, and move on.
Personally I would just store a cookie on the user's machine which holds all the ID's of what he's seen. That way you can keep the 'randomness' and not have to show the items in sequential order as John Feminella's otherwise great solution suggests.
Applying the cookie data in an SQL query would also be trivial: say that you have a comma separated ID's in the cookie, you can just do this (in PHP):
"SELECT image FROM images WHERE id NOT IN(".$_COOKIE['myData'].") ORDER BY RAND() LIMIT 1"
Note that this is just an simple example, you should of course escape the cookie data properly and there might be more efficient ways to select a random entry from a table.
Using a cookie also makes it possible to start off where the user left off the previous time. And cookie sizes won't probably be an issue, you can hold a lot of ID's in 4KB which is (usually) the maximum size of cookie files.
EDIT
If your cookie data looks like this:
$_COOKIE['myData'] == '1,6,19,200,70,16';
You can safely use that data in a SQL query with:
$ids = array_map('mysql_real_escape_string', explode(',', $_COOKIE['myData']));
$query = "SELECT image FROM images WHERE id NOT IN('".implode("', '", $ids)."') ORDER BY RAND() LIMIT 1"
What this will do is that it splits the ID string into individual ID's, then runs mysql_real_escape_string to each of them, then implodes them with quotes so that the query becomes:
$query == "SELECT image FROM images WHERE id NOT IN('1', '6', '19', '200', '70', '16') ORDER BY RAND() LIMIT 1"
So $_COOKIE[] variables are just like any other variable, and you must do same precautions for them as with other data.
You have 2 class of solutions:
state-less
state-full
You need to pick one: (#1) is of course not guaranteed (i.e. probability of showing same image to user is variable) whilst (#2) allows you guarantees (depending on the implementation of course).
Here is another suggestion you might want to consider:
Maintain state on the Client-Side through HTML5 localstorage (when available): the value of this option will only continue to increase as Web Browsers with HTML5 support increases.