Creating random overlapping groups - sample-data

I'm trying to populate a database with sample data, and I'm hoping there's an algorithm out there that can speed up this process.
I have a database of sample people and I need to create a sample network of friend pairings. For example person 1 might be be friends with person 2,3,4, and 7, and person 2 would obviously be friends with person 1, but not necessarily with any of the others.
I'm hoping to find a way to automate the process of creating these randomly generated list of friends within certain parameters, like minimum and maximum number of friends.
Does something like this exist or could someone point me in the right direction?

So I'm not if this is the ideal solution, but it worked for me. Generally the steps were:
Start with an array of people.
Copy the array and shuffle it.
Give each person in the first array a random number (within a range) of random friends (second array).
Remove the person from their own list of friends.
Iterate through each person in each friend list and see if the owner of the list is in their friend's list and if not, add it.
I used a pool of 1000 people, with and initial range of friends of 3-10, and after adding the reciprocals the final average was about 5-27, which was good enough for me.

Related

Formulating an algorithm for a group sorting program with exclusionfactors

I'm trying to formulate an equation/algorithm to solve this problem (for a program I'm writing):
Rules:
A person, p, that is to be sorted can exclude n amount of people from the list. The excluded people, n, cannot be in the same group as p.
The list will contain around 100-150 people.
A group should contain 5-7 people (ideally 6)
My current thoughts:
Take the list count and divide it by 6, which will give me the amount of groups.
Feed people into the groups untill an exclusion occurs. When this happens, try to move the mismatched persons into other groups, based on some sort of score-system untill proper groups are formed.
However, I still feel like I need to put a limit on the amount of people allowed to be excluded per person.
My question is basically how I would figure out how many people a certain person can exclude to make this endeavor possible. Considering there will be around 150 people, each with its own list of persons to exclude, is it even possible? However, some exceptions are ofcourse allowed. Ideas and thoughts are also appriciated!
I'm planning to write the program in java.

n! combinations, how to find best one without killing computer?

I'll get straight to it. I'm working on an web or phone app that is responsible for scheduling. I want students to input courses they took, and I give them possible combinations of courses they should take that fits their requirements.
However, let's say there's 150 courses that fits their requirements and they're looking for 3 courses. That would be 150C3 combinations, right?.
Would it be feasible to run something like this in browser or a mobile device?
First of all you need a smarter algorithm which can prune the search tree. Also, if you are doing this for the same set of courses over and over again, doing the computation on the server would be better, and perhaps precomputing a feasible data structure can reduce the execution time of the queries. For example, you can create a tree where each sub-tree under a node contains nodes that are 'compatible'.
Sounds to me like you're viewing this completely wrong. At most institutions there are 1) curriculum requirements for graduation, and 2) prerequisites for many requirements and electives. This isn't a pure combinatorial problem, it's a dependency tree. For instance, if Course 201, Course 301, and Course 401 are all required for the student's major, higher numbers have the lower numbered ones as prereqs, and the student is a Junior, you should be strongly recommending that Course 201 be taken ASAP.
Yay, mathematics I think I can handle!
If there are 150 courses, and you have to choose 3, then the amount of possibilities are (150*149*148)/(3*2) (correction per jerry), which is certainly better than 150 factorial which is a whole lot more zeros ;)
Now, you really don't want to build an array that size, and you don't have to! All web languages have the idea of randomly choosing an element in an array, so you get an element in an array and request 3 random unique entries from it.
While the potential course combinations is very large, based on your post I see no reason to even attempt to calculate them. This task of random selection of k items from n-sized list is delightfully trivial even for old, slow devices!
Is there any particular reason you'd need to calculate all the potential course combinations, instead of just grab-bagging one random selection as a suggestion? If not, problem solved!
Option 1 (Time\Space costly): let the user on mobile phone browse the list of (150*149*148) possible choices, page by page, the processing is done at the server-side.
Option 2 (Simple): instead of the (150*149*148)-item decision tree, provide a 150-item bag, if he choose one item from the bag, remove it from the bag.
Option 3 (Complex): expand your decision tree (possible choices) using a dependency tree (parent course requires child courses) and the list of course already taken by the student, and his track\level.
As far as I know, most educational systems use the third option, which requires having a profile for the student.

algorithm for equal groups according to Parameters

I have some people's data .each people has grades for few parameters
I want to divide the peoples for N groups that will be as equals as possible in all the parameters.
the parameters are rating. for example - it is most important that parameter 1 will be
equals in the groups,the second parameter is in second priority and the last parameter is The least priority
for example :
there are 100 peoples with data like this:
people1 = ["param1"=12,"param2"=70,"param3"=6]
people2 = ["param1"=9,"param2"=79,"param3"=2]
and I want to divide the peoples to 3 groups (more or less in a same size)
that will have as most as possible equals grades
can someone help me? give idea?
thanks in advance
This post makes me think of me being kid and playing soccer games in the yard with other kids.
There were 2 captains selected, and each one chosen turn by turn one player from the pool for the team. This way teams were balanced at the end.
You sure can make an algorithm from this story, and it's super easy (even for kids :) and brings good results on large amount of data.
Only thing you need - to sort the data by "Strength" of players and divide them.

Multi Attribute Matching of Profiles

I am trying to solve a problem of a dating site. Here is the problem
Each user of app will have some attributes - like the books he reads, movies he watches, music, TV show etc. These are defined top level attribute categories. Each of these categories can have any number of values. e.g. in books : Fountain Head, Love Story ...
Now, I need to match users based on profile attributes. Here is what I am planning to do :
Store the data with reverse indexing. i.f. Each of Fountain Head, Love Story etc is index key to set of users with that attribute.
When a new user joins, get the attributes of this user, find which index keys for this user, get all the users for these keys, bucket (or radix sort or similar sort) to sort on the basis of how many times a user in this merged list.
Is this good, bad, worse? Any other suggestions?
Thanks
Ajay
The algorithm you described is not bad, although it uses a very simple notion of similarity between people.
Let us make it more adjustable, without creating a complicated matching criteria. Let's say people who like the same book are more similar than people who listen to the same music. The same goes with every interest. That is, similarity in different fields has different weights.
Like you said, you can keep a list for each interest (like a book, a song etc) to the people who have that in their profile. Then, say you want to find matches of guy g:
for each interest i in g's interests:
for each person p in list of i
if p and g have mismatching sexual preferences
continue
if p is already in g's match list
g->match_list[p].score += i->match_weight
else
add p to g->match_list with score i->match_weight
sort g->match_list based on score
The choice of weights is not a simple task though. You would need a lot of psychology to get that right. Using your common sense however, you could get values that are not that far off.
In general, matching people is much more complicated than summing some scores. For example a certain set of matching interests may have more (or in some cases less) effect than the sum of them individually. Also, an interest in one may totally result in a rejection from the other no matter what other matching interest exists (Take two very similar people that one of them loves and the other hates twilight for example)

The algorithm used to generate recommendations in Google News?

I'm study recommendation engines, and I went through the paper that defines how Google News generates recommendations to users for news items which might be of their interest, based on collaborative filtering.
One interesting technique that they mention is Minhashing. I went through what it does, but I'm pretty sure that what I have is a fuzzy idea and there is a strong chance that I'm wrong. The following is what I could make out of it :-
Collect a set of all news items.
Define a hash function for a user. This hash function returns the index of the first item from the news items which this user viewed, in the list of all news items.
Collect, say "n" number of such values, and represent a user with this list of values.
Based on the similarity count between these lists, we can calculate the similarity between users as the number of common items. This reduces the number of comparisons a lot.
Based on these similarity measures, group users into different clusters.
This is just what I think it might be. In Step 2, instead of defining a constant hash function, it might be possible that we vary the hash function in a way that it returns the index of a different element. So one hash function could return the index of the first element from the user's list, another hash function could return the index of the second element from the user's list, and so on. So the nature of the hash function satisfying the minwise independent permutations condition, this does sound like a possible approach.
Could anyone please confirm if what I think is correct? Or the minhashing portion of Google News Recommendations, functions in some other way? I'm new to internal implementations of recommendations. Any help is appreciated a lot.
Thanks!
I think you're close.
First of all, the hash function first randomly permutes all the news items, and then for any given person looks at the first item. Since everyone had the same permutation, two people have a decent chance of having the same first item.
Then, to get a new hash function, rather than choosing the second element (which would have some confusing dependencies on the first element), they choose a whole new permutation and take the first element again.
People who happen to have the same hash value 2-4 times (that is, the same first element in 2-4 permutations) are put together in a cluster. This algorithm is repeated 10-20 times, so that each person gets put into 10-20 clusters. Finally, recommendations are given based (the small number of) other people in the 10-20 clusters. Since all this work is done by hashing, people are put directly into buckets for their clusters, and large numbers of comparisons aren't needed.

Resources