Related
I've been tasked with finding the most efficient solution to the following problem: I need to print out the (k) most streamed movies of genre, (g), in the given year, (y), i can assume it takes o(1) to retrieve the current year. An example of this is:
Every time a movie is streamed, i'm given the name of the movie and the genre.
"What are the top 5 most streamed romance movies in the year 2014?
The returned answer might be something like
MovieName1 (romance) 3409 streams
MovieName2 (romance) 4000 streams
MovieName3 (romance) 5340 streams
MovieName4 (romance) 9000 streams
MovieName5 (romance) 10000 streams
So my thought process is to use 3 nested hashtables.
One where i use they name (key) to map to the frequency (value)
One where i use the genre(key) to map to a map(name,frequency)(value)
And the final one where i use the year(key) to map to a map(genre,
map(name, frequency) (value)
Does that make any sense...I think confused myself just by writing this..
Is it possible to just use a single hashtable that uses they year as a key and maps to a linked list of nodes where every single node contains the movie name, frequency, and genre? Would this be more efficient?
So if i wanted to update batmans frequency i can just do map.get(2008) which would give the head of the linked list
and then do
while(tmp != null){
if(tmp.name == "the dark knight"){
temp.frequency++;
}
So you thought about using a hashmap of years to hashmaps of genres to hashmaps of names to frequencies. Does it make sense? Sure. Is it a good way to solve your problem? Most likely not. It is also possible, as you say, to use a single main hashmap of years to collections of genre-name-frequency tuples (or structures, in a lot of languages - like C, C++, Java, and so on). By collections, you thought of linked lists, but you could very well use vectors or something else (linked lists are very often the worst kind of data structures). But this would not be more efficient and is not necessarily a better way to solve your problem, even though it may be more readable and maintainable.
I won't be talking about performance improvements that don't impact time complexity, since you've made it clear in the comments that it's for an exam that only cares about time complexity (which is sad, but whatever). Also, I'm assuming this is the only problem you need to solve.
Let's see how to improve your ideas. It just so happens that a mix of both your solutions, along with one improvement, gives the best possible solution in terms of time complexity, that is O(k). First, note that each movie name is associated with only one frequency, and each frequency is associated with (likely) only one movie name. And since you want to retrieve movie names based on frequencies, a hashmap has no advantage over a linear collection of name-frequency pairs, such as a vector or a linked list. Then, note that each year and each genre is (independently) associated with multiple movies (each with both a name and a frequency). So a hashmap has a place here: with a given year and a given genre, a hashmap-like structure would give you the relevent movies in expected constant time.
Combine these two results and you get a hashmap with years and genres as keys and collections of name-frequency pairs as values. The one improvement that makes it possible to retrieve the k most streamed movies for given year and genre in O(k) time complexity is a sort: if your name-frequency pairs are sorted by frequency in every collection, you can simply return the first (or last, depending on the order) k names.
One detail that you may find weird is that the hashmap uses both years and genres as keys. That's an implementation detail. You can do it by using two nested hashmaps, one using years as keys and the other using genres, or you can combine years and genres and directly use these pairs as keys. It's actually straightforward to hash year-genre pairs.
I am trying to solve a problem of a dating site. Here is the problem
Each user of app will have some attributes - like the books he reads, movies he watches, music, TV show etc. These are defined top level attribute categories. Each of these categories can have any number of values. e.g. in books : Fountain Head, Love Story ...
Now, I need to match users based on profile attributes. Here is what I am planning to do :
Store the data with reverse indexing. i.f. Each of Fountain Head, Love Story etc is index key to set of users with that attribute.
When a new user joins, get the attributes of this user, find which index keys for this user, get all the users for these keys, bucket (or radix sort or similar sort) to sort on the basis of how many times a user in this merged list.
Is this good, bad, worse? Any other suggestions?
Thanks
Ajay
The algorithm you described is not bad, although it uses a very simple notion of similarity between people.
Let us make it more adjustable, without creating a complicated matching criteria. Let's say people who like the same book are more similar than people who listen to the same music. The same goes with every interest. That is, similarity in different fields has different weights.
Like you said, you can keep a list for each interest (like a book, a song etc) to the people who have that in their profile. Then, say you want to find matches of guy g:
for each interest i in g's interests:
for each person p in list of i
if p and g have mismatching sexual preferences
continue
if p is already in g's match list
g->match_list[p].score += i->match_weight
else
add p to g->match_list with score i->match_weight
sort g->match_list based on score
The choice of weights is not a simple task though. You would need a lot of psychology to get that right. Using your common sense however, you could get values that are not that far off.
In general, matching people is much more complicated than summing some scores. For example a certain set of matching interests may have more (or in some cases less) effect than the sum of them individually. Also, an interest in one may totally result in a rejection from the other no matter what other matching interest exists (Take two very similar people that one of them loves and the other hates twilight for example)
I'm study recommendation engines, and I went through the paper that defines how Google News generates recommendations to users for news items which might be of their interest, based on collaborative filtering.
One interesting technique that they mention is Minhashing. I went through what it does, but I'm pretty sure that what I have is a fuzzy idea and there is a strong chance that I'm wrong. The following is what I could make out of it :-
Collect a set of all news items.
Define a hash function for a user. This hash function returns the index of the first item from the news items which this user viewed, in the list of all news items.
Collect, say "n" number of such values, and represent a user with this list of values.
Based on the similarity count between these lists, we can calculate the similarity between users as the number of common items. This reduces the number of comparisons a lot.
Based on these similarity measures, group users into different clusters.
This is just what I think it might be. In Step 2, instead of defining a constant hash function, it might be possible that we vary the hash function in a way that it returns the index of a different element. So one hash function could return the index of the first element from the user's list, another hash function could return the index of the second element from the user's list, and so on. So the nature of the hash function satisfying the minwise independent permutations condition, this does sound like a possible approach.
Could anyone please confirm if what I think is correct? Or the minhashing portion of Google News Recommendations, functions in some other way? I'm new to internal implementations of recommendations. Any help is appreciated a lot.
Thanks!
I think you're close.
First of all, the hash function first randomly permutes all the news items, and then for any given person looks at the first item. Since everyone had the same permutation, two people have a decent chance of having the same first item.
Then, to get a new hash function, rather than choosing the second element (which would have some confusing dependencies on the first element), they choose a whole new permutation and take the first element again.
People who happen to have the same hash value 2-4 times (that is, the same first element in 2-4 permutations) are put together in a cluster. This algorithm is repeated 10-20 times, so that each person gets put into 10-20 clusters. Finally, recommendations are given based (the small number of) other people in the 10-20 clusters. Since all this work is done by hashing, people are put directly into buckets for their clusters, and large numbers of comparisons aren't needed.
We were set an algorithm problem in class today, as a "if you figure out a solution you don't have to do this subject". SO of course, we all thought we will give it a go.
Basically, we were provided a DB of 100 words and 10 categories. There is no match between either the words or the categories. So its basically a list of 100 words, and 10 categories.
We have to "place" the words into the correct category - that is, we have to "figure out" how to put the words into the correct category. Thus, we must "understand" the word, and then put it in the most appropriate category algorthmically.
i.e. one of the words is "fishing" the category "sport" --> so this would go into this category. There is some overlap between words and categories such that some words could go into more than one category.
If we figure it out, we have to increase the sample size and the person with the "best" matching % wins.
Does anyone have ANY idea how to start something like this? Or any resources ? Preferably in C#?
Even a keyword DB or something might be helpful ? Anyone know of any free ones?
First of all you need sample text to analyze, to get the relationship of words.
A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.
A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.
Really poor answer (demonstrates no "understanding") - but as a crazy stab you could hit google (through code) for (for example) "+Fishing +Sport", "+Fishing +Cooking" etc (i.e. cross join each word and category) - and let the google fight win! i.e. the combination with the most "hits" gets chosen...
For example (results first):
weather: fish
sport: ball
weather: hat
fashion: trousers
weather: snowball
weather: tornado
With code (TODO: add threading ;-p):
static void Main() {
string[] words = { "fish", "ball", "hat", "trousers", "snowball","tornado" };
string[] categories = { "sport", "fashion", "weather" };
using(WebClient client = new WebClient()){
foreach(string word in words) {
var bestCategory = categories.OrderByDescending(
cat => Rank(client, word, cat)).First();
Console.WriteLine("{0}: {1}", bestCategory, word);
}
}
}
static int Rank(WebClient client, string word, string category) {
string s = client.DownloadString("http://www.google.com/search?q=%2B" +
Uri.EscapeDataString(word) + "+%2B" +
Uri.EscapeDataString(category));
var match = Regex.Match(s, #"of about \<b\>([0-9,]+)\</b\>");
int rank = match.Success ? int.Parse(match.Groups[1].Value, NumberStyles.Any) : 0;
Debug.WriteLine(string.Format("\t{0} / {1} : {2}", word, category, rank));
return rank;
}
Maybe you are all making this too hard.
Obviously, you need an external reference of some sort to rank the probability that X is in category Y. Is it possible that he's testing your "out of the box" thinking and that YOU could be the external reference? That is, the algorithm is a simple matter of running through each category and each word and asking YOU (or whoever sits at the terminal) whether word X is in the displayed category Y. There are a few simple variations on this theme but they all involve blowing past the Gordian knot by simply cutting it.
Or not...depends on the teacher.
So it seems you have a couple options here, but for the most part I think if you want accurate data you are going to need to use some outside help. Two options that I can think of would be to make use of a dictionary search, or crowd sourcing.
In regards to a dictionary search, you could just go through the database, query it and parse the results to see if one of the category names is displayed on the page. For example, if you search "red" you will find "color" on the page and likewise, searching for "fishing" returns "sport" on the page.
Another, slightly more outside the box option would be to make use of crowd sourcing, consider the following:
Start by more or less randomly assigning name-value pairs.
Output the results.
Load the results up on Amazon Mechanical Turk (AMT) to get feedback from humans on how well the pairs work.
Input the results of the AMT evaluation back into the system along with the random assignments.
If everything was approved, then we are done.
Otherwise, retain the correct hits and process them to see if any pattern can be established, generate a new set of name-value pairs.
Return to step 3.
Granted this would entail some financial outlay, but it might also be one of the simplest and accurate versions of the data you are going get on a fairly easy basis.
You could do a custom algorithm to work specifically on that data, for instance words ending in 'ing' are verbs (present participle) and could be sports.
Create a set of categorization rules like the one above and see how high an accuracy you get.
EDIT:
Steal the wikipedia database (it's free anyway) and get the list of articles under each of your ten categories. Count the occurrences of each of your 100 words in all the articles under each category, and the category with the highest 'keyword density' of that word (e.g. fishing) wins.
This sounds like you could use some sort of Bayesian classification as it is used in spam filtering. But this would still require "external data" in the form of some sort of text base that provides context.
Without that, the problem is impossible to solve. It's not an algorithm problem, it's an AI problem. But even AI (and natural intelligence as well, for that matter) needs some sort of input to learn from.
I suspect that the professor is giving you an impossible problem to make you understand at what different levels you can think about a problem.
The key question here is: who decides what a "correct" classification is? What is this decision based on? How could this decision be reproduced programmatically, and what input data would it need?
I am assuming that the problem allows using external data, because otherwise I cannot conceive of a way to deduce the meaning from words algorithmically.
Maybe something could be done with a thesaurus database, and looking for minimal distances between 'word' words and 'category' words?
Fire this teacher.
The only solution to this problem is to already have the solution to the problem. Ie. you need a table of keywords and categories to build your code that puts keywords into categories.
Unless, as you suggest, you add a system which "understands" english. This is the person sitting in front of the computer, or an expert system.
If you're building an expert system and doesn't even know it, the teacher is not good at giving problems.
Google is forbidden, but they have almost a perfect solution - Google Sets.
Because you need to unterstand the semantics of the words you need external datasources. You could try using WordNet. Or you could maybe try using Wikipedia - find the page for every word (or maybe only for the categories) and look for other words appearing on the page or linked pages.
Yeah I'd go for the wordnet approach.
Check this tutorial on WordNet-based semantic similarity measurement. You can query Wordnet online at princeton.edu (google it) so it should be relatively easy to code a solution for your problem.
Hope this helps,
X.
Interesting problem. What you're looking at is word classification. While you can learn and use traditional information retrieval methods like LSA and categorization based on such - I'm not sure if that is your intent (if it is, then do so by all means! :)
Since you say you can use external data, I would suggest using wordnet and its link between words. For instance, using wordnet,
# S: (n) **fishing**, sportfishing (the act of someone who fishes as a diversion)
* direct hypernym / inherited hypernym / sister term
o S: (n) **outdoor sport, field sport** (a sport that is played outdoors)
+ direct hypernym / inherited hypernym / sister term
# S: (n) **sport**, athletics
(an active diversion requiring physical exertion and competition)
What we see here is a list of relationships between words. The term fishing relates to outdoor sport, which relates to sport.
Now, if you get the drift - it is possible to use this relationship to compute a probability of classifying "fishing" to "sport" - say, based on the linear distance of the word-chain, or number of occurrences, et al. (should be trivial to find resources on how to construct similarity measures using wordnet. when the prof says "not to use google", I assume he means programatically and not as a means to get information to read up on!)
As for C# with wordnet - how about http://opensource.ebswift.com/WordNet.Net/
My first thought would be to leverage external data. Write a program that google-searches each word, and takes the 'category' that appears first/highest in the search results :)
That might be considered cheating, though.
Well, you can't use Google, but you CAN use Yahoo, Ask, Bing, Ding, Dong, Kong...
I would do a few passes. First query the 100 words against 2-3 search engines, grab the first y resulting articles (y being a threshold to experiment with. 5 is a good start I think) and scan the text. In particular I"ll search for the 10 categories. If a category appears more than x time (x again being some threshold you need to experiment with) its a match.
Based on that x threshold (ie how many times a category appears in the text) and how may of the top y pages it appears in you can assign a weigh to a word-category pair.
for better accuracy you can then do another pass with those non-google search engines with the word-category pair (with a AND relationship) and apply the number of resulting pages to the weight of that pair. Them simply assume the word-category pair with highest weight is the right one (assuming you'll even have more than one option). You can also multi assign a word to a multiple category if the weights are close enough (z threshold maybe).
Based on that you can introduce any number of words and any number of categories. And You'll win your challenge.
I also think this method is good to evaluate the weight of potential adwords in advertising. but that's another topic....
Good luck
Harel
Use (either online, or download) WordNet, and find the number of relationships you have to follow between words and each category.
Use an existing categorized large data set such as RCV1 to train your system of choice. You could do worse then to start reading existing research and benchmarks.
Appart from Google there exist other 'encyclopedic" datasets you can build of, some of them hosted as public data sets on Amazon Web Services, such as a complete snapshot of the English language Wikipedia.
Be creative. There is other data out there besides Google.
My attempt would be to use the toolset of CRM114 to provide a way to analyze a big corpus of text. Then you can utilize the matchings from it to give a guess.
My naive approach:
Create a huge text file like this (read the article for inspiration)
For every word, scan the text and whenever you match that word, count the 'categories' that appear in N (maximum, aka radio) positions left and right of it.
The word is likely to belong in the category with the greatest counter.
Scrape delicious.com and search for each word, looking at collective tag counts, etc.
Not much more I can say about that, but delicious is old, huge, incredibly-heavily tagged and contains a wealth of current relevant semantic information to draw from. It would be very easy to build a semantics database this way, using your word list as a basis from scraping.
The knowledge is in the tags.
As you don't need to attend the subject when you solve this 'riddle' it's not supposed to be easy I think.
Nevertheless I would do something like this (told in a very simplistic way)
Build up a Neuronal Network which you give some input (a (e)book, some (e)books)
=> no google needed
this network classifies words (Neural networks are great for 'unsure' classification). I think you may simply know which word belongs to which category because of the occurences in the text. ('fishing' is likely to be mentioned near 'sports').
After some training of the neural network it should "link" you the words to the categories.
You might be able to put use the WordNet database, create some metric to determine how closely linked two words (the word and the category) are and then choose the best category to put the word in.
You could implement a learning algorithm to do this using a monte carlo method and human feedback. Have the system randomly categorize words, then ask you to vote them as "match" or "not match." If it matches, the word is categorized and can be eliminated. If not, the system excludes it from that category in future iterations since it knows it doesn't belong there. This will get very accurate results.
This will work for the 100 word problem fairly easily. For the larger problem, you could combine this with educated guessing to make the process work faster. Here, as many people above have mentioned, you will need external sources. The google method would probably work the best, since google's already done a ton of work on it, but barring that you could, for example, pull data from your facebook account using the facebook apis and try to figure out which words are statistically more likely to appear with previously categorized words.
Either way, though, this cannot be done without some kind of external input that at some point came from a human. Unless you want to be cheeky and, for example, define the categories by some serialized value contained in the ascii text for the name :P
I have a list of requirements for a software project, assembled from the remains of its predecessor. Each requirement should map to one or more categories. Each of the categories consists of a group of keywords. What I'm trying to do is find an algorithm that would give me a score ranking which of the categories each requirement is likely to fall into. The results would be use as a starting point to further categorize the requirements.
As an example, suppose I have the requirement:
The system shall apply deposits to a customer's specified account.
And categories/keywords:
Customer Transactions: deposits, deposit, customer, account, accounts
Balance Accounts: account, accounts, debits, credits
Other Category: foo, bar
I would want the algorithm to score the requirement highest in category 1, lower in category 2, and not at all in category 3. The scoring mechanism is mostly irrelevant to me, but needs to convey how much more likely category 1 applies than category 2.
I'm new to NLP, so I'm kind of at a loss. I've been reading Natural Language Processing in Python and was hoping to apply some of the concepts, but haven't seen anything that quite fits. I don't think a simple frequency distribution would work, since the text I'm processing is so small (a single sentence.)
You might want to look the category of "similarity measures" or "distance measures" (which is different, in data mining lingo, than "classification".)
Basically, a similarity measure is a way in math you can:
Take two sets of data (in your case, words)
Do some computation/equation/algorithm
The result being that you have some number which tells you how "similar" that data is.
With similarity measures, this number is a number between 0 and 1, where "0" means "nothing matches at all" and "1" means "identical"
So you can actually think of your sentence as a vector - and each word in your sentence represents an element of that vector. Likewise for each category's list of keywords.
And then you can do something very simple: take the "cosine similarity" or "Jaccard index" (depending on how you structure your data.)
What both of these metrics do is they take both vectors (your input sentence, and your "keyword" list) and give you a number. If you do this across all of your categories, you can rank those numbers in order to see which match has the greatest similarity coefficient.
As an example:
From your question:
Customer Transactions: deposits,
deposit, customer, account, accounts
So you could construct a vector with 5 elements: (1, 1, 1, 1, 1). This means that, for the "customer transactions" keyword, you have 5 words, and (this will sound obvious but) each of those words is present in your search string. keep with me.
So now you take your sentence:
The system shall apply deposits to a
customer's specified account.
This has 2 words from the "Customer Transactions" set: {deposits, account, customer}
(actually, this illustrates another nuance: you actually have "customer's". Is this equivalent to "customer"?)
The vector for your sentence might be (1, 0, 1, 1, 0)
The 1's in this vector are in the same position as the 1's in the first vector - because those words are the same.
So we could say: how many times do these vectors differ? Lets compare:
(1,1,1,1,1)
(1,0,1,1,0)
Hm. They have the same "bit" 3 times - in the 1st, 3rd, and 4th position. They only differ by 2 bits. So lets say that when we compare these two vectors, we have a "distance" of 2. Congrats, we just computed the Hamming distance! The lower your Hamming distance, the more "similar" the data.
(The difference between a "similarity" measure and a "distance" measure is that the former is normalized - it gives you a value between 0 and 1. A distance is just any number, so it only gives you a relative value.)
Anyway, this might not be the best way to do natural language processing, but for your purposes it is the simplest and might actually work pretty well for your application, or at least as a starting point.
(PS: "classification" - as you have in your title - would be answering the question "If you take my sentence, which category is it most likely to fall into?" Which is a bit different than saying "how much more similar is my sentence to category 1 than category 2?" which seems to be what you're after.)
good luck!
The main characteristics of the problem are:
Externally defined categorization criteria (keyword list)
Items to be classified (lines from the requirement document) are made of a relatively small number of attributes values, for effectively a single dimension: "keyword".
As defined, no feedback/calibrarion (although it may be appropriate to suggest some of that)
These characteristics bring both good and bad news: the implementation should be relatively straight forward, but a consistent level of accuracy of the categorization process may be hard to achieve. Also the small amounts of various quantities (number of possible categories, max/average number of words in a item etc.) should give us room to select solutions that may be CPU and/or Space intentsive, if need be.
Yet, even with this license got "go fancy", I suggest to start with (and stay close to) to a simple algorithm and to expend on this basis with a few additions and considerations, while remaining vigilant of the ever present danger called overfitting.
Basic algorithm (Conceptual, i.e. no focus on performance trick at this time)
Parameters =
CatKWs = an array/hash of lists of strings. The list contains the possible
keywords, for a given category.
usage: CatKWs[CustTx] = ('deposits', 'deposit', 'customer' ...)
NbCats = integer number of pre-defined categories
Variables:
CatAccu = an array/hash of numeric values with one entry per each of the
possible categories. usage: CatAccu[3] = 4 (if array) or
CatAccu['CustTx'] += 1 (hash)
TotalKwOccurences = counts the total number of keywords matches (counts
multiple when a word is found in several pre-defined categories)
Pseudo code: (for categorizing one input item)
1. for x in 1 to NbCats
CatAccu[x] = 0 // reset the accumulators
2. for each word W in Item
for each x in 1 to NbCats
if W found in CatKWs[x]
TotalKwOccurences++
CatAccu[x]++
3. for each x in 1 to NbCats
CatAccu[x] = CatAccu[x] / TotalKwOccurences // calculate rating
4. Sort CatAccu by value
5. Return the ordered list of (CategoryID, rating)
for all corresponding CatAccu[x] values about a given threshold.
Simple but plausible: we favor the categories that have the most matches, but we divide by the overall number of matches, as a way of lessening the confidence rating when many words were found. note that this division does not affect the relative ranking of a category selection for a given item, but it may be significant when comparing rating of different items.
Now, several simple improvements come to mind: (I'd seriously consider the first two, and give thoughts to the other ones; deciding on each of these is very much tied to the scope of the project, the statistical profile of the data to be categorized and other factors...)
We should normalize the keywords read from the input items and/or match them in a fashion that is tolerant of misspellings. Since we have so few words to work with, we need to ensure we do not loose a significant one because of a silly typo.
We should give more importance to words found less frequently in CatKWs. For example the word 'Account' should could less than the word 'foo' or 'credit'
We could (but maybe that won't be useful or even helpful) give more weight to the ratings of items that have fewer [non-noise] words.
We could also include consideration based on digrams (two consecutive words), for with natural languages (and requirements documents are not quite natural :-) ) word proximity is often a stronger indicator that the words themselves.
we could add a tiny bit of importance to the category assigned to the preceding (or even following, in a look-ahead logic) item. Item will likely come in related series and we can benefit from this regularity.
Also, aside from the calculation of the rating per-se, we should also consider:
some metrics that would be used to rate the algorithm outcome itself (tbd)
some logic to collect the list of words associated with an assigned category and to eventually run statistic on these. This may allow the identification of words representative of a category and not initially listed in CatKWs.
The question of metrics, should be considered early, but this would also require a reference set of input item: a "training set" of sort, even though we are working off a pre-defined dictionary category-keywords (typically training sets are used to determine this very list of category-keywords, along with a weight factor). Of course such reference/training set should be both statistically significant and statistically representative [of the whole set].
To summarize: stick to simple approaches, anyway the context doesn't leave room to be very fancy. Consider introducing a way of measuring the efficiency of particular algorithms (or of particular parameters within a given algorithm), but beware that such metrics may be flawed and prompt you to specialize the solution for a given set at the detriment of the other items (overfitting).
I was also facing the same issue of creating a classifier based only on keywords. I was having a class keywords mapper file and which contained class variable and list of keywords occurring in a particular class. I came with the following algorithm to do and it is working really fine.
# predictor algorithm
for docs in readContent:
for x in range(len(docKywrdmppr)):
catAccum[x]=0
for i in range(len(docKywrdmppr)):
for word in removeStopWords(docs):
if word.casefold() in removeStopWords(docKywrdmppr['Keywords'][i].casefold()):
print(word)
catAccum[i]=catAccum[i]+counter
print(catAccum)
ind=catAccum.index(max(catAccum))
print(ind)
predictedDoc.append(docKywrdmppr['Document Type'][ind])