Grouping web site users by routes they make - algorithm

I’m writing a simple website. I want to be able to group users by routes they make on my sites. For example I have this site tree, but the final product will be more complicated. Lets say I have three users.
User one route is A->B->C->D
User two route is A->B->C->E
User three route is J->K
We can say that users one and two belongs to same group and user three belongs to some other group.
My question is: what algorithm or maybe more than one I should use to accomplish that> Also what data do I need to collect?
I have some ideas, however, I want to confront it with someone who might have more experience than me.
I’m looking for suggestions rather than an exact solution for my problem. Also, if there are any ready-made solution which I can read, I will appreciate it also.

You could also consider common subpaths. A pointer in this direction and general web usage analysis is: http://pdf.aminer.org/000/473/298/improving_the_effectiveness_of_a_web_site_with_web_usage.pdf

As a first cut, it seems reasonable to divide the problem into (1) defining a similarity score for two traces and (2) using the similarity scores to cluster. One possibility for (1) is Levenshtein distance. Two possibilities for (2) are farthest-point clustering and k-modes.

Related

Fuzzy identity fingerprinting

I have a spreadsheet with values like address, name, IBAN, e-mail and want to identify when a customer last time bought something.
The problem is: some fields contain spelling mistakes, others were deliberately entered wrong.
On GitHub, several libraries like https://github.com/seatgeek/fuzzywuzzy, https://github.com/seamusabshere/fuzzy_match or https://github.com/atom/fuzzaldrin are available to perform fuzzy searches based on a single and comparable column. But I want to combine multiple fields - this sounds like a common problem and I expected to find existing solutions out there.
Can you recommend approaches for such a problem? Are there existing projects for such a problem which I am missing?
Is a regular string-distance over all the fields usually good enough?
I mentioned it in your other question, but the dedupe python library does what you want.
Basically, it calculates the distance between each field in a pair of rows, then learns optimal weights to combine those distances into a single record-pair score.
So far I believe http://blog.yhat.com/posts/fuzzy-matching-with-yhat.html and using fuzzyWuzzy seems to be the best approach.

"Who to follow" algorithm

I want to give users the ability to view some personalized users they might find interesting and might follow them...
I was thinking of it like that:
- Get all users he is currently following
- Get all followers that they follow
- rank them by total posts they made (DESC), filled up personal information fields
- show 5 of them on each page load
in case user has followers then an information message will appear...
Can this kind of feature be done with this algorithm or is there a better or even easier way to do it?
In your algorithm, I'm wondering why you need to sort users based on number of posts, maybe it has something to do with reputation?
Recommendation is indeed a very large, open topic, and is also a hot academic research fields. If we are working on a practical project, I think it will be nice to to stay simple and focused.
I witnessed the following two kinds of recommendations on a very popular
social website. From my experience, the recommendation output is of high quality. Here I'm brainstorming the algorithms behind. Hope it helps.
Discover persons you might know: Recommend person whose 'following set' intersects with your 'following set'. It is based on the "clustering effect" of social network: The friend of your friend is more likely to be your friend.
Recommend person based on interests: If the users could be celebrities, companies, institutions, press media, etc., then recommendations like the following might be useful: "People following #Linus also follow #Stallman, #LinuxDeveloper, ...". Suppose you've just followed #Linus, to recommend #Stallman, #LinuxDeveloper, first we need to find out all users following #Linus, then figure out their common following list, possibly ranked by number of followers. The idea is to recommend users based on interest correlations. We calculate and discover high correlation users, assuming that users' following list are grouped by their interests.
(I'm also thinking, algorithm 1 will discover persons that share common interests with you, if users could be celebrities, etc.. This might be preferred for some scenarios.)
You're asking a very open-ended question here - how to pick a small number of recommendations out of a large set. So the answer is - you can make it as simple or as complicated as you want it to be! The simplest would be to pick a few at random (and any more complex algorithm had better prove that it produces better results than that.) Your solution of gathering all users who are two hops away, and then ranking by number of posts, is just a bit more complex, and then at the other extreme are the sophisticated algorithms used by the Amazons and Googles of the world. Companies put a lot of effort into building this sort of thing - have you heard of the Netflix Prize?
as I understand you want to follow the user that could offer high quality information about your Thema .we need an Algorithm to give this user as result to us ,but how can I find these users:
The users that have many Followers are a good choice but not always many of users in Twitter follow another users only as respect or ethiquet.
The users that his/her twitts retwitt many times with other user is a good choice
and the user that they are mentioned many times by other users.
I think ,to find theses users we should use Link based Analyse such as HITS or Page rank algorithim
You may want to consider not including people that are following the given user. I imagine might not be so interested in you, and this could potentially be problematic. However, you maybe very interested in finding more about the people that is following.
Are you considering showing the user the reason why these people were recommended to them? For example, saying like you may be interested in what little billy is saying because of his connection to your wife. If so, to potentially avoid angered users, it may be worth allowing them to in a sense opt-out.
It seems like other than that, it seems like it is a pretty good way of recommending users that someone would be interested in. The only other things that I can think of that might also help find people with similar interests, is if you allow users to tag posts. Allowing you to find users by similar interests, or by what they are posting about.
One other more problematic thing that you could look into is finding users by similar interest. for example, if person a is following person c, and person b is following person c, then maybe recommend person a to person b. though this seems like it could make for some very lengthy queries if you are not careful.

How is this comparison/ranking algorithm called?

I've seen some sites where they show two random items from a list, and users pick which one they prefer, and then based on the results of the user preferences, a ranking is generated for the entire data set. Does anyone know what this ranking algorithm is called and how it works?
Thank you.
I believe you're referring to the ELO rating system.
A simple implementation would be to always choose two random items for the comparison and give the preferred item a point. Then rank in order of decreasing points.
The usual method for this is collaborative filtering. For this usually the choices of all persons are compared and a similarity between persons is used to weight their choices, when recommending or rating items. That means, people who have shown similar choices to yours before, are used more to generate recommendations than those who have shown different behavior.
There are several methods for doing this inference, and which one is best or how to optimize the performance is an open research questions. Most often the simplest implementation will achieve sufficent predictions and is easy to implement. It just does a two multiplications of the preference matrix with itself transposed.

Intelligent web features, algorithms (people you may follow, similar to you ...)

I have 3 main questions about the algorithms in intelligent web (web 2.0)
Here the book I'm reading http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665 and I want to learn the algorithms in deeper
1. People You may follow (Twitter)
How can one determine the nearest result to my requests ? Data mining? which algorithms?
2. How you’re connected feature (Linkedin)
Simply algorithm works like that. It draws the path between two nodes let say between Me and the other person is C. Me -> A, B -> A connections -> C . It is not any brute force algorithms or any other like graph algorithms :)
3. Similar to you (Twitter, Facebook)
This algorithms is similar to 1. Does it simply work the max(count) friend in common (facebook) or the max(count) follower in Twitter? or any other algorithms they implement? I think the second part is true because running the loop
dict{count, person}
for person in contacts:
dict.add(count(common(person)))
return dict(max)
is a silly act in every refreshing page.
4. Did you mean (Google)
I know that they may implement it with phonetic algorithm http://en.wikipedia.org/wiki/Phonetic_algorithm simply soundex http://en.wikipedia.org/wiki/Soundex and here is the Google VP of Engineering and CIO Douglas Merrill speak http://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s
What about first 3 questions? Any ideas are welcome !
Thanks
People who you may follow
You can use the factors based calculations:
factorA = getFactorA(); // say double(0.3)
factorB = getFactorB(); // say double(0.6)
factorC = getFactorC(); // say double(0.8)
result = (factorA+factorB+factorC) / 3 // double(0.5666666666666667)
// if result is more than 0.5, you show this person
So say in the case of Twitter, "People who you may follow" can based on the following factors (User A is the user viewing this "People who you may follow" feature, there may be more or less factors):
Relativity between frequent keywords found in User A's and User B's tweets
Relativity between the profile description of both users
Relativity between the location of User A and B
Are people User A is following follows User B?
So where do they compare "People who you may follow" from? The list probably came from a combination of people with high amount of followers (they are probably celebrities, alpha geeks, famous products/services, etc.) and [people whom User A is following] is following.
Basically there's a certain level of data mining to be done here, reading the tweets and bios, calculations. This can be done on a daily or weekly cron job when the server load is least for the day (or maybe done 24/7 on a separate server).
How are you connected
This is probably a smart work here to make you feel that loads of brute force has been done to determine the path. However after some surface research, I find that this is simple:
Say you are User A; User B is your connection; and User C is a connection of User B.
In order for you to visit User C, you need to visit User B's profile first. By visiting User B's profile, the website already save the info indiciating that User A is at User B's profile. So when you visit User C from User B, the website immediately tells you that 'User A -> User B -> User C', ignoring all other possible paths.
This is the max level as at User C, User Acannot go on to look at his connections until User C is User A's connection.
Source: observing LinkedIN
Similar to you
It's the exact same thing as #1 (People you may follow), except that the algorithm reads in a different list of people. The list of people that the algorithm reads in is the people whom you follow.
Did you mean
Well you got it right there, except that Google probably used more than just soundex. There's language translation, word replacement, and many other algorithms used for the case of Google. I can't comment much on this because it will probably get very complex and I am not an expert to handle languages.
If we research a little more into Google's infrastructure, we can find that Google has servers dedicated to Spelling and Translation services. You can get more information on Google platform at http://en.wikipedia.org/wiki/Google_platform.
Conclusion
The key to highly intensified algorithms is caching. Once you cache the result, you don't have to load it every page. Google does it, Stack Overflow does it (on most of the pages with list of questions) and Twitter not surprisingly too!
Basically, algorithms are defined by developers. You may use others' algorithms, but ultimately, you can also create your own.
People you may follow
Could be one of many types of recommendation algorithms, maybe collaborative filtering?
How you are connected
This is just a shortest path algorithm on the social graph. Assuming there is no weight to the connections, it will simply use breadth-first.
Similar to you
Simply a re-arrangement of the data set using the same algorithm as People you may follow.
Check out the book Programming Collective Intelligence for a good introduction to the type of algorithms that are used for People you may follow and Similar to you, it has great python code available too.
People You may follow
From Twitter blog - "suggestions are based on several factors, including people you follow and the people they follow" http://blog.twitter.com/2010/07/discovering-who-to-follow.html
So if you follow A and B and they both follow C, then Twitter will suggest C to you...
How you’re connected feature
I think you have answered this one.
Similar to you
As above and as you say, although the results are probably cached - so its only done once per session or maybe even less frequently...
Hope that helps,
Chris
I don't use twitter; but with that in mind:
1). On the surface, this isn't that difficult: For each person I follow, see who they follow. Then for each of the people they follow, see who they follow, etc. The deeper you go, of course, the more number crunching it takes.
You can take this a bit further, if you can also efficiently extract the reverse: For those I follow, who also follows them?
For both ways, what's unsaid is a way to weight the tweeters to see if they're someone I'd really want to follow: A liberal follower may also follow a conservative tweeter, but that doesn't mean I'd want follow the conservative (see #3).
2). Not sure, thinking about it...
3). Assuming the bio and tweets are the only thing to go on, the hard parts are:
Deciding what attributes should exist (political affiliation, topic types, etc.)
Cleaning each 140 characters to data-mine.
Once you have the right set of attributes, then two different algorithms come to mind:
K means clustering, to decide which attributes I tend to discriminate on.
N-Nearest neighbor, to find the N most similar tweeters to you given the attributes I tend to give weight to.
EDIT: Actually, a decision tree is probably a FAR better way to do all of this...
This is all speculative, but it sounds fun if one were getting paid to do this.

Algorithm for a planning tool

I'm writing a small software application that needs to serve as a simple planning tool for a local school. The 'problem' it needs to solve is fairly basic. Namely, the teachers need to talk with the parents of all children. However, some children have, of course, brothers and sisters in different groups, so these talks need to be scheduled next to eachother, to avoid the situations were parents have a talk at 6 pm and another one at 10 pm. Thus in short, given a collection of n children, where some children have 1 or more brothers or sisters, generate a schedule where all the talks of these children are planned next to each other.
Now, maybe the problem can be solved extremely easy, but on the other I have a feeling this can be a pretty complicated problem, that needs and can be solved with some sort of algorithm. Elegantly. But am I right? Is there? I've looked at the Hungarian alorithm but it doesn't quite apply to this particular problem.
Edit: I forgot to mention, that all talks take the same amount of time.
Thanks!
I think it is quite easy.
First group the kids which belong together because they share parents. Schedule the children inside a group consecutively, schedule the rest as random.
Another way to abstract it and make the problem easier is to look from the parent perspective, see brothers and sister as "child" and give them more time. Then you can just schedule the parents at random, but some need more time (because they have multiple childeren).
One approach woul dbe to define the problem in a declarative constraint language and then let it solve the problem for you. The last time I did this, I used ECLiPSe, which is a nifty little language where you define your problem space by constraints, and then let it find allowable values that satisfy those constraints.
For example, I believe you have two classes of constraints:
A teacher may only have one
conference at a time
All students in the same family must
have consecutive slots
Once you define these in ECLiPSe, it will calculate values for each student that satisfy the requirements. If you go this way, you can also easily add constraints as you need to. For example, it's easy to say that a teach is unavailable for slot Y, or teachers must take turns doing administrative work, etc.
This sorts feels like a "backpack algorithm" type of problem. You need to group the family members together then fill slots appropriately.
If you google "backpack algorithm", you'll see enough write-ups to make your head spin and also some good coded solutions.
I think if each talk could be reduced to "activities" where each activity has a start time and an end time you could use the activity-selection algorithm studied in computer science. It is based on a greedy approach and could be solved in O(n) (where n is the number of activities). You could find more information here. I am sure you will need to have to do a pre-processing here to be able to reduce the brother/sister issue as activities of the same type.

Resources