How to interpolate user interests

How to interpolate user interests - algorithm

I have a database of users with one field describing their interests in form of an array of strings. I'd like to write an algorithm which is able to take as input the list of users and returns groups of users with the same interests. The end goal would be to suggest the users what they may be interested into based on what users with similar interests are interested into. Could anybody suggest an algorithm that I could implement to achieve such result?
Thank you.

Related

Grouping web site users by routes they make

I’m writing a simple website. I want to be able to group users by routes they make on my sites. For example I have this site tree, but the final product will be more complicated. Lets say I have three users.
User one route is A->B->C->D
User two route is A->B->C->E
User three route is J->K
We can say that users one and two belongs to same group and user three belongs to some other group.
My question is: what algorithm or maybe more than one I should use to accomplish that> Also what data do I need to collect?
I have some ideas, however, I want to confront it with someone who might have more experience than me.
I’m looking for suggestions rather than an exact solution for my problem. Also, if there are any ready-made solution which I can read, I will appreciate it also.

You could also consider common subpaths. A pointer in this direction and general web usage analysis is: http://pdf.aminer.org/000/473/298/improving_the_effectiveness_of_a_web_site_with_web_usage.pdf

As a first cut, it seems reasonable to divide the problem into (1) defining a similarity score for two traces and (2) using the similarity scores to cluster. One possibility for (1) is Levenshtein distance. Two possibilities for (2) are farthest-point clustering and k-modes.

"Who to follow" algorithm

I want to give users the ability to view some personalized users they might find interesting and might follow them...
I was thinking of it like that:
- Get all users he is currently following
- Get all followers that they follow
- rank them by total posts they made (DESC), filled up personal information fields
- show 5 of them on each page load
in case user has followers then an information message will appear...
Can this kind of feature be done with this algorithm or is there a better or even easier way to do it?

In your algorithm, I'm wondering why you need to sort users based on number of posts, maybe it has something to do with reputation?
Recommendation is indeed a very large, open topic, and is also a hot academic research fields. If we are working on a practical project, I think it will be nice to to stay simple and focused.
I witnessed the following two kinds of recommendations on a very popular
social website. From my experience, the recommendation output is of high quality. Here I'm brainstorming the algorithms behind. Hope it helps.
Discover persons you might know: Recommend person whose 'following set' intersects with your 'following set'. It is based on the "clustering effect" of social network: The friend of your friend is more likely to be your friend.
Recommend person based on interests: If the users could be celebrities, companies, institutions, press media, etc., then recommendations like the following might be useful: "People following #Linus also follow #Stallman, #LinuxDeveloper, ...". Suppose you've just followed #Linus, to recommend #Stallman, #LinuxDeveloper, first we need to find out all users following #Linus, then figure out their common following list, possibly ranked by number of followers. The idea is to recommend users based on interest correlations. We calculate and discover high correlation users, assuming that users' following list are grouped by their interests.
(I'm also thinking, algorithm 1 will discover persons that share common interests with you, if users could be celebrities, etc.. This might be preferred for some scenarios.)

You're asking a very open-ended question here - how to pick a small number of recommendations out of a large set. So the answer is - you can make it as simple or as complicated as you want it to be! The simplest would be to pick a few at random (and any more complex algorithm had better prove that it produces better results than that.) Your solution of gathering all users who are two hops away, and then ranking by number of posts, is just a bit more complex, and then at the other extreme are the sophisticated algorithms used by the Amazons and Googles of the world. Companies put a lot of effort into building this sort of thing - have you heard of the Netflix Prize?

as I understand you want to follow the user that could offer high quality information about your Thema .we need an Algorithm to give this user as result to us ,but how can I find these users:
The users that have many Followers are a good choice but not always many of users in Twitter follow another users only as respect or ethiquet.
The users that his/her twitts retwitt many times with other user is a good choice
and the user that they are mentioned many times by other users.
I think ,to find theses users we should use Link based Analyse such as HITS or Page rank algorithim

You may want to consider not including people that are following the given user. I imagine might not be so interested in you, and this could potentially be problematic. However, you maybe very interested in finding more about the people that is following.
Are you considering showing the user the reason why these people were recommended to them? For example, saying like you may be interested in what little billy is saying because of his connection to your wife. If so, to potentially avoid angered users, it may be worth allowing them to in a sense opt-out.
It seems like other than that, it seems like it is a pretty good way of recommending users that someone would be interested in. The only other things that I can think of that might also help find people with similar interests, is if you allow users to tag posts. Allowing you to find users by similar interests, or by what they are posting about.
One other more problematic thing that you could look into is finding users by similar interest. for example, if person a is following person c, and person b is following person c, then maybe recommend person a to person b. though this seems like it could make for some very lengthy queries if you are not careful.

algorithm for ranking search results based on previous usage

First and foremost, no I'm not asking please tell me how Google is built in two sentences. What I am asking is slightly different. I have a database filled with textual data that users input. We also give them the functionality to search for this data later. The problem is, we do a simple full text search now and return the results in any order. I'd like to return the results based on a weight, a weight of how often the user types in something. An an example a user might type in the following:
"foo"
"bo"
"bob"
"bob"
"bob"
"bo"
"foo2"
Based on the above data, a search on 'b' should return bo and bob, but bob should be listed first. It is the most relevant based on usage.
Curious, what algorithm should I research to build this in an effective fashion? Any books based on common web algorithms (I know this isn't just web specific) out there that will explain this?

there is various search algorithms out there.
Here's a little guidepost to some of them:
http://en.wikipedia.org/wiki/Search_algorithm
not an expert myself in this area, so I cannot recommend a specific one.

I don't know how you'd do this in the context of a database, but here's one way to go about it:
Use a trie to store each unique word and the count of how often it was used. When your user starts typing, the trie allows you to efficiently grab all the string with the given prefix, which you can then sort using the words' counts as keys.

We use apache solr for our search.
In this technology, I think, this is normally done via boosting. So index your data and every day or so then boost individual documents based on user queries.

I need a problem for my algorithm. I need like search engine input data

Imagine you have some products, or items, or just anything that you want to see in some order of importance. Like you want in a search engine. Like websites. And you don't know how to sort them. But you have some criteria that give you a clue. You have a bag of criteria, and according to each, you can find a sorting, but you cannot aggregate them to one preference list.
Well, I can. It's part of my thesis and I'd like to show the practical usefulness. I would appreciate suggestions on what to sort here and which criteria to use.
I thought about things like: A DVD store sorting DVDs according to: quality of the medium, match with the query string, user votes.
So, I would enjoy to have a real-world problem including data, where users would tell me if they like my sorting. And where I can see if the obtained sorting is useful. That's kind of the point: is this better than the standard algorithms.
cheers,
niko

You could sort programming questions by date, preferred and disliked tags, number of answers, votes,... to find the most interesting ones :)

The netflix prize? See here: http://www.netflixprize.com/ Maybe it is more about clustering than sorting.

How to detect duplicate data?

I have got a simple contacts database but I'm having problems with users entering in duplicate data. I have implemented a simple data comparison but unfortunately the duplicated data that is being entered is not exactly the same. For example, names are incorrectly spelled or one person will put in 'Bill Smith' and another will put in 'William Smith' for the same person.
So is there some sort of algorithm that can give a percentage for how similar an entry is to another?

So is there some sort of algorithm
that can give a percentage for how
similar an entry is to another?
Algorithms as Soundex and Edit distances (as suggested in a previous post) can solve some of your problems. However, if you are serious about cleaning your data, this will not be enough. As others have stated "Bill" does not sound anything like "William".
The best solution I have found is to use a reduction algorithm and table to reduce the names to it's root name.
To your regular Address table, add Root-versions of the names, e.g
Person (Firstname, RootFirstName, Surname, Rootsurname....)
Now, create a mapping table.
FirstNameMappings (Primary KEY Firstname, Rootname)
Populate your Mapping table by:
Insert IGNORE (select Firstname, "UNDEFINED" from Person) into FirstNameMappings
This will add all firstnames that you have in your person table together with the RootName of "UNDEFINED"
Now, sadly, you will have to go through all the unique first names and map them to a RootName. For example "Bill", "Billl" and "Will" should all be translated to "William"
This is very time consuming, but if data quality really is important for you I think it's one of the best ways.
Now use the newly created mapping table to update the "Rootfirstname" field in your Person table. Repeat for surname and address. Once this is done you should be able to detect duplicates without suffering from spelling errors.

You can compare the names with the Levenshtein distance. If the names are the same, the distance is 0, else it is given by the minimum number of operations needed to transform one string into the other.

I imagine that this problem is well understood but what occurs to me on first reading is:
compare fields individually
count those that match (for a possibly loose definition of match, and possibly weighing the fields differently)
present for human intervention any cases which pass some threshold
Use your existing database to get a good first guess for the threshold, and correct as you accumulate experience.
You may prefer a fairly strong bias toward false positives, at least at first.

While I do not have an algorithm for you, my first action would be to take a look at the process involved in entering a new contact. Perhaps users do not have an easy way to find the contact they are looking for. Much like on Stack Overflow's new question form, you could suggest contacts that already exist on the new contact screen.

If you have access SSIS check out the Fuzzy grouping and Fuzzy lookup transformation.
http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services
http://msdn.microsoft.com/en-us/library/ms137786.aspx

If you have a large database with string fields, you can very quickly find a lot of duplicates by using the simhash algorithm.

This may or may not be related but, minor misspellings might be detected by a Soundex search, e.g., this will allow you to consider Britney Spears, Britanny Spares, and Britny Spears as duplicates.
Nickname contractions, however, are difficult to consider as duplicates and I doubt if it is wise. There are bound to be multiple people named Bill Smith and William Smith, and you would have to iterate that with Charles->Chuck, Robert->Bob, etc.
Also, if you are considering, say, Muslim users, the problems become more difficult (there are too many Muslims, for example, that are named Mohammed/Mohammad).

I'm not sure it will work well for the names vs nicknames problem, but the most common algorithm in this sort of area would be the edit distance / Levenshtein distance algorithm. It's basically a count of the number of character changes, additions and removals required to turn one item into another.
For names, I'm not sure you're ever going to get good results with a purely algorithmic approach - What you really need is masses of data. Take, for example, how much better Google spelling suggestions are than those in a normal desktop application. This is because Google can process billions of web queries and look at what queries lead to each other, what 'did you mean' links actually get clicked etc.
There are a few companies which specialise in the name matching problem (mostly for national security and fraud applications). The one I could remember, Search Software America seems to have been bought out by these guys http://www.informatica.com/products_services/identity_resolution/Pages/index.aspx, but I suspect any of these sorts of solutions would be far to expensive for a contacts application.

FullContact.com has API's that can solve this for you, see their documentation here: http://www.fullcontact.com/developer/docs/?category=name.
They have APIs for Name Normalization (Bill into William), Name Deducer (for raw text), and Name Similarity (comparing two names).
All APIs are free at the moment, it could be a good way to get started.

You might also want to look into probabilistic matching.

For those wandering around the web and end up here, might I suggest that you try using a Google Sheet add-on I created called Flookup.
It's particularly good with names and it has a couple of other awesome features which I'll describe below:
Say you have a list of names and there are 2 people called "John Smith". You can use the rank parameter from Flookup to instruct the algorithm to return the 1st, 2nd, 3rd or nth best match. This is helpful if you have additional information that you can use to identify the "John Smith" you want.
Say you have an additional database/list of apartment numbers. You an specify which "John Smith" you want by typing: John Smith & Apartment A or John Smith & Apartment B as the lookup parameter to help distinguish between the two names.
I hope you find Flookup as beneficial as others have.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio