I'm trying to populate a database with sample data, and I'm hoping there's an algorithm out there that can speed up this process.
I have a database of sample people and I need to create a sample network of friend pairings. For example person 1 might be be friends with person 2,3,4, and 7, and person 2 would obviously be friends with person 1, but not necessarily with any of the others.
I'm hoping to find a way to automate the process of creating these randomly generated list of friends within certain parameters, like minimum and maximum number of friends.
Does something like this exist or could someone point me in the right direction?
So I'm not if this is the ideal solution, but it worked for me. Generally the steps were:
Start with an array of people.
Copy the array and shuffle it.
Give each person in the first array a random number (within a range) of random friends (second array).
Remove the person from their own list of friends.
Iterate through each person in each friend list and see if the owner of the list is in their friend's list and if not, add it.
I used a pool of 1000 people, with and initial range of friends of 3-10, and after adding the reciprocals the final average was about 5-27, which was good enough for me.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am a beginner IT student and doing a project for my programming logic and design class. I need to create a psuedocode for a dice game that allows you 2 rolls with 5 dice. On the first roll you get to pick 1 die to keep. The computer then rolls the other 4 dice and calculates you're score based on what you rolled. There are 3 rolls per game and the total score is displayed. Rolling nothing takes points away. The scoring is: 2 of a kind=50 points, 3 of a kind=75 points, 4 of a kind=100 points and nothing subtracts 50 points.
The whole problem I have is I dont even know where to start. I think I need this to repeat 3 times, but what variables do set? Please someone help me, I cant really ask my instructor because he is outside smoking the whole class and everything I have learned about this class mostly came from the internet and reading the book. I dont want to fail this class...someone please help me through this???
First of all don't panic. What you are about to do is break the task down into small steps.
Pseudo-code is not really code - you can't use it directly as a language, but instead it is just plain english to describe what it is you are doing and the flow of events.
So what are the initial steps to get you started?
Ask yourself what are the facts, what do you know exist in advance. These are the "declarations" that you make.
You have five dice. Each is a seperate object so each gets it's own variable declaration
dice_1
dice_2
dice_3
dice_4
dice_5
Next decide if each die has an initial value
dice_1 initial value = 0
etc...
Next you know that you have to throw the dice a number of times. Throwing is a variable with an initial value
turns initial value = 2
turns_counter initial value = 2
You should be getting the idea now. Are there other things you should declare in advance? I think so!
Next you have to decide what it is you are doing step by step. Is it just a sequence of events or is it repeating? If it's repeating how do you get it to stop?
While turns_counter is less than 2
Repeat the following:
turns_counter = turns_counter + 1
if turns_counter = 2
Throw. Collect_result. Sum_result.
else
Throw. Collect_result. Sum_result. Remove_a_dice.
endif.
perhaps you have to tell the reusable code which objects they are going to be working with? These are parameters that you pass to the reusable code Throw(dice_1) perhaps also you need to update some variables that you created? do it in the reusable code with or without passing them as parameters.
This is by no means complete or perfect, but you should get the idea about what's going on and how to break it down. It could take quite a while to do.
Most languages provide a pseudo-random number generator function that returns a random number within a certain range. I would start by figuring out which language you'll use and which function it provides.
Once you have that, you will need to call it for each roll of each dice. If you are rolling 5 dice, you would call it 5 times. And you would call it 5 more times for a second roll.
That's a start anyway.
You have already almost answered the question by simply writing it down here. There is no strict definition of what pseudocode is. Why don't you start by re-writing what you've described here as a sequence of steps. Then, for each step simply refine that step further until you think you've made it as fine-grain as you like.
You could start with something like this:
Roll 5 dice.
Pick 1 die to keep.
Rolls the other 4 dice
Calculate the score.
// etc...
Quite weird to think that it's easier to ask SO than your instructor! :)
The easiest way to get started on this is to not rigorously bind yourself to the constraint of a specific language, or even to pseudocode. Simply, in natural English, write out how you would do this. Imagine that YOU are the computer, and somebody wants to play the game with you. Just imagine, in very specific detail, what you would do at each potential step, i.e.
Give the user 5 dice
Ask the user to roll them
From that roll, allow the user to pick one die to keep
...etc. Once you have done this, and you are sure it is correct, start transforming it into pseudo code by thinking about what a computer would need to do to solve this problem. For instance, you'll need a variable keeping track of how many points the user as, as well as how many total rolls have occurred. If you were very specific in your English description of the problem, this should mean you basically only need to plug pseudo code into a few sentences you already have - in other words, you're just substituting one type of pseudo code for another.
I'd like to help, but straight-up providing the pseudo code wouldn't be very helpful to you. One of the hardest steps in beginning programming is learning to break a problem down into its constituent elements. That type of granular thinking is unintuitive at first, but gets easier the more time you spend on it.
Well, pseudo-code, in my experience, is best drawn up when you pretend you're writing up the work for someone else to do:
THINGS WE NEED
Dice
Players
Score
THINGS WE TRACK
Dice rolls
Player score
THINGS WE KNOW
(These are also called constants)
Nothing (-50)
2 of a kind (+50)
3 of a kind (+75)
4 of a kind (+100)
All of these are vital tools to getting started. And...well, asking questions on stackoverflow.
Next, define your "actions" (things we do), which utilizes the above known things that we will need.
I would start the same place I always do: creating our things.
def player():
"""Create a new player"""
def dice():
"""Creates 4 new, 6 sided dice"""
def welcome():
"""Welcome player by name, give option to quit"""
def game():
"""Initialize number of turns (start at 0)"""
def humanturn():
"""Roll dice, display, ask which one they'll keep"""
def compturn():
"""Roll four dice"""
def check():
"""Check for any matches in the dice"""
def score():
"""Tally up the score for any matches"""
def endturn():
"""Update turn(s), update total score"""
def gameover():
"""Display name, total score, ask for retry"""
def quit():
"""Quit the game"""
Those are your components, all fleshed out in a very procedural manner. There are many other ways to do this that are much better, but for now you're just writing the skeleton of an idea. You may be tempted to combine many of these methods together when you're ready to start coding, but it's a good idea to separate everything until you're confident you won't get lost chasing down a bug.
Good luck!
I'm study recommendation engines, and I went through the paper that defines how Google News generates recommendations to users for news items which might be of their interest, based on collaborative filtering.
One interesting technique that they mention is Minhashing. I went through what it does, but I'm pretty sure that what I have is a fuzzy idea and there is a strong chance that I'm wrong. The following is what I could make out of it :-
Collect a set of all news items.
Define a hash function for a user. This hash function returns the index of the first item from the news items which this user viewed, in the list of all news items.
Collect, say "n" number of such values, and represent a user with this list of values.
Based on the similarity count between these lists, we can calculate the similarity between users as the number of common items. This reduces the number of comparisons a lot.
Based on these similarity measures, group users into different clusters.
This is just what I think it might be. In Step 2, instead of defining a constant hash function, it might be possible that we vary the hash function in a way that it returns the index of a different element. So one hash function could return the index of the first element from the user's list, another hash function could return the index of the second element from the user's list, and so on. So the nature of the hash function satisfying the minwise independent permutations condition, this does sound like a possible approach.
Could anyone please confirm if what I think is correct? Or the minhashing portion of Google News Recommendations, functions in some other way? I'm new to internal implementations of recommendations. Any help is appreciated a lot.
Thanks!
I think you're close.
First of all, the hash function first randomly permutes all the news items, and then for any given person looks at the first item. Since everyone had the same permutation, two people have a decent chance of having the same first item.
Then, to get a new hash function, rather than choosing the second element (which would have some confusing dependencies on the first element), they choose a whole new permutation and take the first element again.
People who happen to have the same hash value 2-4 times (that is, the same first element in 2-4 permutations) are put together in a cluster. This algorithm is repeated 10-20 times, so that each person gets put into 10-20 clusters. Finally, recommendations are given based (the small number of) other people in the 10-20 clusters. Since all this work is done by hashing, people are put directly into buckets for their clusters, and large numbers of comparisons aren't needed.
We were set an algorithm problem in class today, as a "if you figure out a solution you don't have to do this subject". SO of course, we all thought we will give it a go.
Basically, we were provided a DB of 100 words and 10 categories. There is no match between either the words or the categories. So its basically a list of 100 words, and 10 categories.
We have to "place" the words into the correct category - that is, we have to "figure out" how to put the words into the correct category. Thus, we must "understand" the word, and then put it in the most appropriate category algorthmically.
i.e. one of the words is "fishing" the category "sport" --> so this would go into this category. There is some overlap between words and categories such that some words could go into more than one category.
If we figure it out, we have to increase the sample size and the person with the "best" matching % wins.
Does anyone have ANY idea how to start something like this? Or any resources ? Preferably in C#?
Even a keyword DB or something might be helpful ? Anyone know of any free ones?
First of all you need sample text to analyze, to get the relationship of words.
A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.
A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.
Really poor answer (demonstrates no "understanding") - but as a crazy stab you could hit google (through code) for (for example) "+Fishing +Sport", "+Fishing +Cooking" etc (i.e. cross join each word and category) - and let the google fight win! i.e. the combination with the most "hits" gets chosen...
For example (results first):
weather: fish
sport: ball
weather: hat
fashion: trousers
weather: snowball
weather: tornado
With code (TODO: add threading ;-p):
static void Main() {
string[] words = { "fish", "ball", "hat", "trousers", "snowball","tornado" };
string[] categories = { "sport", "fashion", "weather" };
using(WebClient client = new WebClient()){
foreach(string word in words) {
var bestCategory = categories.OrderByDescending(
cat => Rank(client, word, cat)).First();
Console.WriteLine("{0}: {1}", bestCategory, word);
}
}
}
static int Rank(WebClient client, string word, string category) {
string s = client.DownloadString("http://www.google.com/search?q=%2B" +
Uri.EscapeDataString(word) + "+%2B" +
Uri.EscapeDataString(category));
var match = Regex.Match(s, #"of about \<b\>([0-9,]+)\</b\>");
int rank = match.Success ? int.Parse(match.Groups[1].Value, NumberStyles.Any) : 0;
Debug.WriteLine(string.Format("\t{0} / {1} : {2}", word, category, rank));
return rank;
}
Maybe you are all making this too hard.
Obviously, you need an external reference of some sort to rank the probability that X is in category Y. Is it possible that he's testing your "out of the box" thinking and that YOU could be the external reference? That is, the algorithm is a simple matter of running through each category and each word and asking YOU (or whoever sits at the terminal) whether word X is in the displayed category Y. There are a few simple variations on this theme but they all involve blowing past the Gordian knot by simply cutting it.
Or not...depends on the teacher.
So it seems you have a couple options here, but for the most part I think if you want accurate data you are going to need to use some outside help. Two options that I can think of would be to make use of a dictionary search, or crowd sourcing.
In regards to a dictionary search, you could just go through the database, query it and parse the results to see if one of the category names is displayed on the page. For example, if you search "red" you will find "color" on the page and likewise, searching for "fishing" returns "sport" on the page.
Another, slightly more outside the box option would be to make use of crowd sourcing, consider the following:
Start by more or less randomly assigning name-value pairs.
Output the results.
Load the results up on Amazon Mechanical Turk (AMT) to get feedback from humans on how well the pairs work.
Input the results of the AMT evaluation back into the system along with the random assignments.
If everything was approved, then we are done.
Otherwise, retain the correct hits and process them to see if any pattern can be established, generate a new set of name-value pairs.
Return to step 3.
Granted this would entail some financial outlay, but it might also be one of the simplest and accurate versions of the data you are going get on a fairly easy basis.
You could do a custom algorithm to work specifically on that data, for instance words ending in 'ing' are verbs (present participle) and could be sports.
Create a set of categorization rules like the one above and see how high an accuracy you get.
EDIT:
Steal the wikipedia database (it's free anyway) and get the list of articles under each of your ten categories. Count the occurrences of each of your 100 words in all the articles under each category, and the category with the highest 'keyword density' of that word (e.g. fishing) wins.
This sounds like you could use some sort of Bayesian classification as it is used in spam filtering. But this would still require "external data" in the form of some sort of text base that provides context.
Without that, the problem is impossible to solve. It's not an algorithm problem, it's an AI problem. But even AI (and natural intelligence as well, for that matter) needs some sort of input to learn from.
I suspect that the professor is giving you an impossible problem to make you understand at what different levels you can think about a problem.
The key question here is: who decides what a "correct" classification is? What is this decision based on? How could this decision be reproduced programmatically, and what input data would it need?
I am assuming that the problem allows using external data, because otherwise I cannot conceive of a way to deduce the meaning from words algorithmically.
Maybe something could be done with a thesaurus database, and looking for minimal distances between 'word' words and 'category' words?
Fire this teacher.
The only solution to this problem is to already have the solution to the problem. Ie. you need a table of keywords and categories to build your code that puts keywords into categories.
Unless, as you suggest, you add a system which "understands" english. This is the person sitting in front of the computer, or an expert system.
If you're building an expert system and doesn't even know it, the teacher is not good at giving problems.
Google is forbidden, but they have almost a perfect solution - Google Sets.
Because you need to unterstand the semantics of the words you need external datasources. You could try using WordNet. Or you could maybe try using Wikipedia - find the page for every word (or maybe only for the categories) and look for other words appearing on the page or linked pages.
Yeah I'd go for the wordnet approach.
Check this tutorial on WordNet-based semantic similarity measurement. You can query Wordnet online at princeton.edu (google it) so it should be relatively easy to code a solution for your problem.
Hope this helps,
X.
Interesting problem. What you're looking at is word classification. While you can learn and use traditional information retrieval methods like LSA and categorization based on such - I'm not sure if that is your intent (if it is, then do so by all means! :)
Since you say you can use external data, I would suggest using wordnet and its link between words. For instance, using wordnet,
# S: (n) **fishing**, sportfishing (the act of someone who fishes as a diversion)
* direct hypernym / inherited hypernym / sister term
o S: (n) **outdoor sport, field sport** (a sport that is played outdoors)
+ direct hypernym / inherited hypernym / sister term
# S: (n) **sport**, athletics
(an active diversion requiring physical exertion and competition)
What we see here is a list of relationships between words. The term fishing relates to outdoor sport, which relates to sport.
Now, if you get the drift - it is possible to use this relationship to compute a probability of classifying "fishing" to "sport" - say, based on the linear distance of the word-chain, or number of occurrences, et al. (should be trivial to find resources on how to construct similarity measures using wordnet. when the prof says "not to use google", I assume he means programatically and not as a means to get information to read up on!)
As for C# with wordnet - how about http://opensource.ebswift.com/WordNet.Net/
My first thought would be to leverage external data. Write a program that google-searches each word, and takes the 'category' that appears first/highest in the search results :)
That might be considered cheating, though.
Well, you can't use Google, but you CAN use Yahoo, Ask, Bing, Ding, Dong, Kong...
I would do a few passes. First query the 100 words against 2-3 search engines, grab the first y resulting articles (y being a threshold to experiment with. 5 is a good start I think) and scan the text. In particular I"ll search for the 10 categories. If a category appears more than x time (x again being some threshold you need to experiment with) its a match.
Based on that x threshold (ie how many times a category appears in the text) and how may of the top y pages it appears in you can assign a weigh to a word-category pair.
for better accuracy you can then do another pass with those non-google search engines with the word-category pair (with a AND relationship) and apply the number of resulting pages to the weight of that pair. Them simply assume the word-category pair with highest weight is the right one (assuming you'll even have more than one option). You can also multi assign a word to a multiple category if the weights are close enough (z threshold maybe).
Based on that you can introduce any number of words and any number of categories. And You'll win your challenge.
I also think this method is good to evaluate the weight of potential adwords in advertising. but that's another topic....
Good luck
Harel
Use (either online, or download) WordNet, and find the number of relationships you have to follow between words and each category.
Use an existing categorized large data set such as RCV1 to train your system of choice. You could do worse then to start reading existing research and benchmarks.
Appart from Google there exist other 'encyclopedic" datasets you can build of, some of them hosted as public data sets on Amazon Web Services, such as a complete snapshot of the English language Wikipedia.
Be creative. There is other data out there besides Google.
My attempt would be to use the toolset of CRM114 to provide a way to analyze a big corpus of text. Then you can utilize the matchings from it to give a guess.
My naive approach:
Create a huge text file like this (read the article for inspiration)
For every word, scan the text and whenever you match that word, count the 'categories' that appear in N (maximum, aka radio) positions left and right of it.
The word is likely to belong in the category with the greatest counter.
Scrape delicious.com and search for each word, looking at collective tag counts, etc.
Not much more I can say about that, but delicious is old, huge, incredibly-heavily tagged and contains a wealth of current relevant semantic information to draw from. It would be very easy to build a semantics database this way, using your word list as a basis from scraping.
The knowledge is in the tags.
As you don't need to attend the subject when you solve this 'riddle' it's not supposed to be easy I think.
Nevertheless I would do something like this (told in a very simplistic way)
Build up a Neuronal Network which you give some input (a (e)book, some (e)books)
=> no google needed
this network classifies words (Neural networks are great for 'unsure' classification). I think you may simply know which word belongs to which category because of the occurences in the text. ('fishing' is likely to be mentioned near 'sports').
After some training of the neural network it should "link" you the words to the categories.
You might be able to put use the WordNet database, create some metric to determine how closely linked two words (the word and the category) are and then choose the best category to put the word in.
You could implement a learning algorithm to do this using a monte carlo method and human feedback. Have the system randomly categorize words, then ask you to vote them as "match" or "not match." If it matches, the word is categorized and can be eliminated. If not, the system excludes it from that category in future iterations since it knows it doesn't belong there. This will get very accurate results.
This will work for the 100 word problem fairly easily. For the larger problem, you could combine this with educated guessing to make the process work faster. Here, as many people above have mentioned, you will need external sources. The google method would probably work the best, since google's already done a ton of work on it, but barring that you could, for example, pull data from your facebook account using the facebook apis and try to figure out which words are statistically more likely to appear with previously categorized words.
Either way, though, this cannot be done without some kind of external input that at some point came from a human. Unless you want to be cheeky and, for example, define the categories by some serialized value contained in the ascii text for the name :P