We are implementing extended boolean model, but we cannot figure out how to use the formula given: http://en.wikipedia.org/wiki/Extended_Boolean_model The formula here:
contains three "variables" but we have no clue what they means. Assume we have already processed the collection of documents, so we have mapped all words in collection and for each term we have the count of occurations in each document as well as count of occurencies (of concrete term) in the whole collection.
I says right there "The weight of term Kx associated with document dj".
So we are talking about term 'x' and document 'j'. 'i' is the value that maximizes Idfi (the term that has the highest frequency).
Related
This question was posted on Leetcode here.
Design a data structure, that can return a top trending keyword. The time complexity should be as minimal as possible. TBH I don't know how yet, optimal the solution can be.
Given 2 parameters as input:
Parameter 1: String Username
Parameter 2: Array of String containing the keywords tweeted by the user.
Function declaration: function tweet(username, keywords[]){};
Example 1:
tweet("User1",["love","dog"])
tweet("User2",["cat"])
tweet("User3",["walk","cat"])
tweet("User2",["dog"])
tweet("User3",["like","dog"])
top trending keyword : Dog
Example 2:
a. tweet("User1",["Dog"])
b. tweet("User1",["like","Dog"])
c. tweet("User1",["love","Dog"])
d. tweet("User1",["walk","Dog"])
e. tweet("User1",["hate","Dog"])
f. tweet("User2",["like","cat"])
g. tweet("User3",["cat"])
top trending keyword: cat
explanation: Only consider the number of unique users who tweeted a particular keyword while calculating the top trending keyword.
For this question, I was able to come up with a solution using (similar to one posted on Leetcode here)
1. Map, which holds the set of words for a given user,
2. Map which holds the word and it's unique user count.
3. Max Heap -> used to retrieve the top word based on the frequency.
However, for all the words that's already in Map 2, if I add it to the PQ, I need to do a remove operation which is O(n), and then add it again with increased frequency in the PQ.
E.g. in example 2 above, up to operation e
After a, Map1-[<User1,[Dog]>], Map2- [[<Dog,1>], PQ-[1-Dog]
After b, Map1-[[<User1,[Dog,like]>]], Map2- [<Dog,1>,<like,1>], PQ-[1-Dog,1-like]
...
After e,
Map1- [[<User1,[Dog,like,love,walk,hate]>]],
Map2- [<Dog,1>,<like,1>,<love,1>,<walk,1>,<hate,1>],
**PQ-[1-Dog,1-like,1-love,1-walk,1-hate]**
After f,
Map1- [<User1,[Dog,like,love,walk,hate]>,<User2,[like,cat]>],
Map2- [<Dog,1>,<like,2>,<love,1>,<walk,1>,<hate,1>,<cat,1>],
**PQ- [2-like,1-Dog,1-love,1-walk,1-hate]**
My question is: After adding entry "User2 - like,cat" in step f above, I need to re-balance the max heap, i.e. remove "like" and add it back. So that now it's at the top of the heap.
Is this the optimal way? Or I can optimize it further. So that I don't incur the cost of remove() or re-balancing.
I tried with a TreeMap too, but cannot figure out the data structures.
Consider there are 10 billion words that people have searched for in google. Corresponding
to each word you have the sorted list of all document id's. The list looks like this:
[Word 1]->[doc_i1,doc_j1,.....]
[Word 2]->[doc_i2,doc_j2,.....]
...
...
...
[Word N]->[doc_in,doc_jn,.....]
I am looking for an algorithm to find 100 rare word-pairs.
A rare word-pair is a pair of words that occur together(not necessarily contiguous) in
exactly 1 document.
I am looking for something better than O(n^2) if possible.
You order the words according to the number of documents they occur in. The idea here is, that words that occur rarely at all, will occur rarely in pairs as well. If you find words that occur in exactly one document, just pick any other word from that document and you are done.
Then you start inverting the index, starting with the rarest word. That means you create a map where each document points to the set of words in it. At first you create that inverted index with the rarest word only. After you inserted all documents associated with that rarest word into the inverted index, you have a map where each document points to exactly one word.
Then you add the next word with all its documents, still following the order created in (1.). At some point you will find that a document associated with a word is already present in your inverted map. Here you check all words associated with that document if they form such a rare word pair.
The performance of the thing depends heavily on how far you have to go to find 100 such pairs, the idea is that you are done after processing only a small fraction of the total data set. To take advantage of the fact that you only process a small fraction of the data, you should employ in (1.) a sort algorithm that allows you to find the smallest elements long before the entire set has been sorted, like quick sort. Then the sorting can be done in like O(N*log(N1) ), with N1 being the number of words you actually need to add to the inverted index before finding 100 pairs. The complexity of the other operations, namely adding a word to the inverted index and checking if a word pair occurs in more than one document also is linear with the number of documents per word, so those operations should be speedy at the beginning and slow down later, because later you have more documents per word.
This is the opposite of "Frequent Itemset Mining"
i.e. check this recent publication: Rare pattern mining: challenges and future perspectives
Google states that a "term-vector algorithm" can be used to determine popular keywords. I have studied http://en.wikipedia.org/wiki/Vector_space_model, but cant understand the term "term-vector algorithm".
Please explain it in a brief summary, very simple language, as if the reader is a child.
I believe "vector" refers to the mathematics definition, a quantity having direction as well as magnitude. How is it that keywords have a quantity moving in a direction?
http://en.wikipedia.org/wiki/Vector_space_model states "Each dimension corresponds to a separate term." I thought dimension relates to cardinality, is that correct?
From the book Hadoop In Practice, by Alex Holmes, page 12.
It means that each word forms a separate dimension:
Example: (shamelessly taken from here)
For a model containing only three words you would get:
dict = { dog, cat, lion }
Document 1
“cat cat” → (0,2,0)
Document 2
“cat cat cat” → (0,3,0)
Document 3
“lion cat” → (0,1,1)
Document 4
“cat lion” → (0,1,1)
The most popular example for MapReduce is to calculate work frequency; namely, a map step to output the word as key with 1 as a value, and a reduce step to sum the numbers for each word. So if a web page has a list of (possibly duplicate) words that occur, each word in that list maps to 1. The reduce step essentially counts how many times each word occurs in that page. You can do this across pages, websites, or whatever criteria. The resulting data is a dictionary mapping word to frequency which is effectively a term frequency vector.
Example document: "a be see be a"
Resulting data: { 'a':2, 'be':2, 'see':1 }
Term vector sounds like it just mean that each term has a weight or number value attached, probably corresponding to the number of times the term is mentioned.
You are thinking of the geometric meaning of the word vector but there is another mathematical meaning that just means multiple dimensions ie instead of saying x,y,z you say the vector x in bold that has multiple dimensions x1, x2, x3...xn and some values. So for a term vector, the vector is term and it takes the form term1, term2 up to term n. Each can then have a value, just as x,y, or z has a value.
As an example term 1 could be dog, term 2 cat, term3 lion and each has a weight, 2, 3, 1, meaning the word dog appears twice, cat 3 times and lion 1 time.
I happen to be building the binary search in Python, but the question has more to do with binary search structure in general.
Let's assume I have about one thousand eligible candidates I am searching through using binary search, doing the classic approach of bisecting the sorted dataset and repeating this process in order to narrow down the eligible set to iterate over. The candidates are just strings of names,(first-last format, eg "Peter Jackson") I initially sort the set alphabetically and then proceed with bisection using something like this:
hi = len(names)
lo = 0
while lo < hi:
mid = (lo+hi)//2
midval = names[mid].lower()
if midval < query.lower():
lo = mid+1
elif midval > query.lower():
hi=mid
else:
return midval
return None
This code adapted from here: https://stackoverflow.com/a/212413/215608
Here's the thing, the above procedure assumes a single exact match or no result at all. What if the query was merely for a "Peter", but there are several peters with differing last names? In order to return all the Peters, one would have to ensure that the bisected "bins" never got so small as to except eligible results. The bisection process would have to cease and cede to something like a regex/regular old string match in order to return all the Peters.
I'm not so much asking how to accomplish this as what this type of search is called... what is a binary search with a delimited criteria for "bin size" called? Something that conditionally bisects the dataset, and once the criteria is fulfilled, falls back to some other form of string matching in order to ensure that there can effectively be a ending wildcard on the query (so a search for a "Peter" will get "Peter Jacksons" and "Peter Edwards")
Hopefully I've been clear what I mean. I realize in the typical DB scenario the names might be separated, this is just intended as a proof of concept.
I've not come across this type of two-stage search before, so don't know whether it has a well-known name. I can, however, propose a method for how it can be carried out.
Let say you've run the first stage and have found no match.
You can perform the second stage with a pair of binary searches and a special comparator. The binary searches would use the same principle as bisect_left and bisect_right. You won't be able to use those functions directly since you'll need a special comparator, but you can use them as the basis for your implementation.
Now to the comparator. When comparing the list element x against the search key k, the comparator would only use x[:len(k)] and ignore the rest of x. Thus when searching for "Peter", all Peters in the list would compare equal to the key. Consequently, bisect_left() to bisect_right() would give you the range containing all Peters in the list.
All of this can be done using O(log n) comparisons.
In your binary search you either hit an exact match OR an area where the match would be.
So in your case you need to get the upper and lower boundaries (hi lo as you call them) for the area that would include the Peter and return all the intermediate strings.
But if you aim to do something like show next words of a word you should look into Tries instead of BSTs
Take the following string as an example:
"The quick brown fox"
Right now the q in quick is at index 4 of the string (starting at 0) and the f in fox is at index 16. Now lets say the user enters some more text into this string.
"The very quick dark brown fox"
Now the q is at index 9 and the f is at index 26.
What is the most efficient method of keeping track of the index of the original q in quick and f in fox no matter how many characters are added by the user?
Language doesn't matter to me, this is more of a theory question than anything so use whatever language you want just try to keep it to generally popular and current languages.
The sample string I gave is short but I'm hoping for a way that can efficiently handle any size string. So updating an array with the offset would work with a short string but will bog down with to many characters.
Even though in the example I was looking for the index of unique characters in the string I also want to be able to track the index of the same character in different locations such as the o in brown and the o in fox. So searching is out of the question.
I was hoping for the answer to be both time and memory efficient but if I had to choose just one I care more about performance speed.
Let's say that you have a string and some of its letters are interesting. To make things easier let's say that the letter at index 0 is always interesting and you never add something before it—a sentinel. Write down pairs of (interesting letter, distance to the previous interesting letter). If the string is "+the very Quick dark brown Fox" and you are interested in q from 'quick' and f from 'fox' then you would write: (+,0), (q,10), (f,17). (The sign + is the sentinel.)
Now you put these in a balanced binary tree whose in-order traversal gives the sequence of letters in the order they appear in the string. You might now recognize the partial sums problem: You enhance the tree so that nodes contain (letter, distance, sum). The sum is the sum of all distances in the left subtree. (Therefore sum(x)=distance(left(x))+sum(left(x)).)
You can now query and update this data structure in logarithmic time.
To say that you added n characters to the left of character c you say distance(c)+=n an then go and update sum for all parents of c.
To ask what is the index of c you compute sum(c)+sum(parent(c))+sum(parent(parent(c)))+...
Your question is a little ambiguous - are you looking to keep track of the first instances of every letter? If so, an array of length 26 might be the best option.
Whenever you insert text into a string at a position lower than the index you have, just compute the offset based on the length of the inserted string.
It would also help if you had a target language in mind as not all data structures and interactions are equally efficient and effective in all languages.
The standard trick that usually helps in similar situations is to keep the characters of the string as leaves in a balanced binary tree. Additionally, internal nodes of the tree should keep sets of letters (if the alphabet is small and fixed, they could be bitmaps) that occur in the subtree rooted at a particular node.
Inserting or deleting a letter into this structure only needs O(log(N)) operations (update the bitmaps on the path to root) and finding the first occurence of a letter also takes O(log(N)) operations - you descend from the root, going for the leftmost child whose bitmap contains the interesting letter.
Edit: The internal nodes should also keep number of leaves in the represented subtree, for efficient computation of the letter's index.