Related
In studying for my exam I came across this question.
A website streams movies to customers’ TVs or other devices. Movies are in one of several genres such as action, drama, mystery, etc. Every movie is in exactly one genre (so that if a movie is an action movie as well as a comedy, it is in a genre called “action-Comedy”). The site has around 10 million customers, and around 25,000 movies, but both are growing rapidly. The site wants to keep track of the most popular movies streamed. You have been hired as the lead engineer to develop a tracking program.
i) Every time a movie is streamed to a customer, its name (e.g. “Harold and Kumar: Escape from Guantanamo Bay”) and genre (“Comedy”) is sent to your program so it can update the data structures it maintains.
(Assume your program can get the current year with a call to an appropriate Java class, in O(1) time.)
ii) Also, every once in a while, customers want to know what were the top k most streamed movies in genre g in year y. (If y is the current year, then accounting is done up to the current date.) For example, what were the top 10 most streamed comedy movies in 2010? Here k = 10, g=”comeday” and y = 2010. This query is sent to your program which should output the top k movie names.
Describe the data structures and algorithms used to implement both requirements. For (i), analyze the big O running time to update the data structures, and for (ii) the big O running time to output the top k streamed movies.
My thought process was to create a hash table, with every new movie added to its respective genre in the hash table in a linked list. As for the second part, my only idea is to keep the linked list sorted but that seems way too expensive. What is a better alternative?
I use a heap to keep track of the top k objects of a class (k fixed). You can find the details of this data structure in any CS text, but basically it's a binary tree in which every node is smaller than either of its children. The main operation, which we will call reheap(node) assumes that both the children of node are heaps, compares node with the smaller of its two children, does the swap if necessary, and recursively calls reheap for the modified child. The class needs to have an overloaded operator< or the equivalent defined to do this.
At any point in time, the heap holds the top k objects with the smallest of these at the top of the heap. When a new object arrives which is bigger than the top of the heap, it replaces that object on the heap, and then
reheap is called. This can also happen at a node other than the top node if an object already on the heap becomes bigger than its smaller child. Another type of update occurs if an object already on the heap becomes smaller than its parent (this probably won't happen in the case you describe). Here it gets swapped with its parent and we then compare recursively against the grandparent, etc.
All of these updates have complexity O(log(k)). If you need to output the heap sorted from the top down, the same structure works well in time
O(k log(k)). (This process is known as heapsort).
Since swapping objects can be expensive, I usually keep the objects in a fixed array somewhere, and implement the heap as an array, A, of pointers, where the children of A[i] are A[2i+1] and A[2i+2].
You could do this in O(1) using one hash table "HT1" to map from (genre, year, movie_title) to an iterator into a linked list of (num_times_streamed, hash table of movie titles). You use the iterator to see if the next element in the list is for one greater streaming count and if so insert your movie title there and remove it from the other table (which if empty can be removed from the list), otherwise if the existing hash table has no other titles then increment the num_times_streamed, otherwise insert a new hash table in the list and add your title. Update the record of the iterator in HT1 as necessary.
Note that as described above the operations in the list use the end-points or an existing iterator to step through by no more than one position as the num_times_streamed value is incremented, so O(1).
To get the top k titles you'll need a hash table HT2 from { genre, year } to each of the linked lists: simply iterate from the end of the list and you'll encounter a hash table with a movie or movies with the highest streaming count, then the next highest and so on. If the year's just changed, you may not find k entries, handle that however you like. If when looking up a movie title it's found not to exist in HT1, you'd add a new list for that genre and the current year to HT2.
More visually, using { } around hash tables (whether mappings or sets), [ ] around linked lists, and ( ) around grouped struct/tuple data:
HT2 = { "comedy 2015": [ (1, { "title1", "title2" }),
(2, { "title3" }), <--------\
(4, { "title4" }) ], |
"drama 2012": [ (1, { "title5" }), |
(3, { "title6" }) ], |
... | .
}; | .
| .
HT1 = { "title3", -----------------------------------/ |
"title2", ---------------------------------------/
...
};
I have a two dimensional array in Java:
private static final string[][] namestr = { { John, Mark, David,
}, { Peter, Ken, Mary,
}, { Fisher, Alice, Chris,
}, { Tod, Jen, Joe, Morton
}};
How can I write this Two dimensional array in Apex?
I still need to keep it as a two dimensional array with [4][3],
Thanks
You can have a list of lists. I think the proper term is "jagged array" as opposed to "multidimensional array". What I mean is that it's up to you to make sure each internal list has same size; language itself won't enforce this in any way. But since your last row has 4 items instead of 3 looks like you're OK with that.
http://www.salesforce.com/us/developer/docs/apexcode/Content/langCon_apex_collections_lists.htm
Because lists can contain any collection, they can be nested within
one another and become multidimensional. For example, you can have a
list of lists of sets of Integers. A list can contain up to four
levels of nested collections inside it.
List<List<String>> names = new List<List<String>>{
new List<String>{'John', 'Mark', 'David'},
new List<String>{'Peter', 'Ken', 'Mary'},
new List<String>{'Fisher', 'Alice', 'Chris'},
new List<String>{'Tod', 'Jen', 'Joe', 'Morton'}
};
System.debug(names[1]);
System.debug(names[1][2]);
There's no size limit, standard stuff like heap space limit applies. If you plan to pass them to Visualforcethe limit is 1000 items though (10K if your page has readonly attribute set).
Any guidance or pointing me to an example would be much appreciated (I can't formulate a good search term on the Googleplex).
I have a model using enums that i define in a dictionary and then render on the view with #Html.RadioButtonFor, etc.
Here is an example of my model:
public PaymentPlanList PaymentPlan { get; set; }
public enum PaymentPlanList
{
PaymentPlan_One,
PaymentPlan_Two,
}
public class PaymentPlanDictionary
{
public static readonly Dictionary<PaymentPlanList, string> paymentplanDictionary = new Dictionary<PaymentPlanList, string>
{
{ PaymentPlanList.PaymentPlan_One, "One full payment in advance (receive the lowest price)." },
{ PaymentPlanList.PaymentPlan_Two, "Two payments: first payment of 50% due up front, the balance of 50% due within 30 days (increases fee by $100)." },
};
static string ConvertPaymentPlan(PaymentPlanList paymentplanlist)
{
string name;
return (paymentplanDictionary.TryGetValue(paymentplanlist, out name))
? name : paymentplanlist.ToString();
}
static void Main()
{
Console.WriteLine(ConvertPaymentPlan(PaymentPlanList.PaymentPlan_One));
Console.WriteLine(ConvertPaymentPlan(PaymentPlanList.PaymentPlan_Two));
}
}
And, for completeness, this is my view related to the above:
<p>
#Html.RadioButtonFor(m => m.PaymentPlan, "PaymentPlan_One")
One full payment in advance (receive the lowest price).
</p>
<p>
#Html.RadioButtonFor(m => m.PaymentPlan, "PaymentPlan_Two")
Two payments: first payment 50% due up front, the balance of 50% due within 30 days (increases fee by $100).
</p>
This is a quote system I have users fill out. For this particular service, say I charge $1,000.00. This is the base price. Based on user input, this price will be changed, and I want to show that to the user. So, if the user selects the first option, the price remains unchanged. If the user selects the second option, the fee is increased by $100.00.
This changes exponentially, since there are more inputs that affect the price (if selected).
Ultimately, based on the user inputs, I need to calculate the total. I am rendering a view which will display the total. I was thinking of using some #{} blocks and if/else if statements to either a) show nothing if what was selected does not increase the total, or b) showing the additional amount (e.g., $100.00), and then later showing a total.
Something like (EDITING here for clarity):
Base service: $1,000.00
Addon service1: $100.00 (only if user selects "PaymentPlan_Two" for two payments of 50% each (from the PaymentPlanList enum), otherwise hidden (and no addition of the $100.00) if user selects "PaymentPan_One" and pays in full)
Addon service2: $0.00 (this is hidden and a $0.00 or no value since the user did not select anything from a separate enum, but the value of $100.00 would be added if selected, which would make the Total $1,200.00 if it were selected; ALTERNATIVELY, how could I handle if there were 3 or more items in the list? E.g., Choice_One is $0.00, Choice_Two is $100.00 and Choice_Three is $200.00)
TOTAL: $1,100.00
Thanks for any help.
let's see if I understand your requirements correctly:
The application needs to add prices to a base price, depending on
selection of Addon services.
These selections are from a Dictionary, which is based on an Enum
Therefore we're looking to store the price against the Enum to keep the data associations in one place.
It is possible to store a single value against an Enum:
public enum PaymentPlanList
{
PaymentPlan_One = 100,
PaymentPlan_Two = 200,
}
However, I don't think this would be flexible enough for our needs - Enums only allow integers, and are commonly used this way in bitwise operations (where the values are multiples of 2).
I think a better solution here might be to use a Model-View-View-Model (MVVM) which can contain the logic about which services are available, how much they cost, and which services are valid in combination with other services.
There's an ticket pricing example (which sounds similar in concept to the domain here) on the Knockout.js home page that re-calculates a travel ticket price on the client web-page based on a user selection.
Say there's a list. Each item in the list has a unique id.
List [5, 2, 4, 3, 1]
When I remove an item from this list, the unique id from the item goes with it.
List [5, 2, 3, 1]
Now say I want to add another item to the list, and give it the least lowest unique id.
What's the easiest way to get the lowest unique id when adding a new item to the list?
Here's the restriction though: I'd prefer it if I didn't reassign the unique id of another item when deleting an item.
I realise it would be easy to find the unique id if I reassigned unique id 5 to unique id 4 when I deleted 4. Then I could get the length of the list (5) and creating the new item with the unique id with that number.
So is there another way, that doesn't involve iterating through the entire list?
EDIT:
Language is java, but I suppose I'm looking for a generic algorithm.
An easy fast way is to just put your deleted ids in a priority queue, and just pick the next id from there when you insert new ones (or use size() + 1 of the first list as id when the queue is empty). This would however require another list.
You could maintain a list of available ID's.
Declare a boolean array (pseudo code):
boolean register[3];
register[0] = false;
register[1] = false;
register[2] = false;
When you add an element, loop from the bottom of the register until a false value is found. Set the false value to true, assign that index as the unique identifier.
removeObject(index)
{
register[index] = false;
}
getsetLowestIndex()
{
for(i=0; i<register.size;i++)
{
if(register[i]==false)
{
register[i] = true;
return i;
}
}
// Array is full, increment register size
register.size = register.size + 1;
register[register.size] = true;
return register.size;
}
When you remove an element, simply set the index to false.
You can optimise this for larger lists by having continuality markers so you don't need to loop the entire thing.
This would work best for your example where the indexes are in no particular order, so you skip the need to sort them first.
Its equivalent to a search, just this time you search for a missing number. If your ID's are sorted integers, you can start going from bottom to top checking if the space between two ID's is 1.
If you know how many items in the list and its sorted you can implement a binary search.
I don't think you can do this without iterating through the list.
When you say
'Now say I want to add another item to
the list, and give it the least
highest unique id. '
I assume you mean you want to assign the lowest available ID that has not been used elsewhere.
You can do this:
private int GetLowestFreeID(List list){
for (int idx = 0; idx < list.Length; ++i){
if ( list[idx] == idx ) continue;
else return idx;
}
}
this returns the lowest free index.
This assumes your list is sorted, and is in C# but you get the idea.
The data structure that would be used to do this is a Priority Binary Heap that only allow unique values.
How about keeping the list sorted. and than you can remove it from one end easily.
Imagine the following problem:
You have a database containing about 20,000 texts in a table called "articles"
You want to connect the related ones using a clustering algorithm in order to display related articles together
The algorithm should do flat clustering (not hierarchical)
The related articles should be inserted into the table "related"
The clustering algorithm should decide whether two or more articles are related or not based on the texts
I want to code in PHP but examples with pseudo code or other programming languages are ok, too
I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.
APPROACH 1
<?php
$zeit = time();
function check($str1, $str2){
$minprozent = 60;
similar_text($str1, $str2, $prozent);
$prozent = sprintf("%01.2f", $prozent);
if ($prozent > $minprozent) {
return TRUE;
}
else {
return FALSE;
}
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
$rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
$rel2 = mysql_query($rel1);
$rel2a = mysql_num_rows($rel2);
if ($rel2a > 0) {
while ($rel3 = mysql_fetch_assoc($rel2)) {
if (check($sql3['text'], $rel3['text']) == TRUE) {
$id_a = $sql3['id'];
$id_b = $rel3['id'];
$rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
$rein2 = mysql_query($rein1);
$rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
$rein4 = mysql_query($rein3);
}
}
}
}
?>
APPROACH 2 [only check()]
<?php
function square($number) {
$square = pow($number, 2);
return $square;
}
function check($text1, $text2) {
$words_sub = text_splitter($text2); // splits the text into single words
$words = text_splitter($text1); // splits the text into single words
// document 1 start
$document1 = array();
foreach ($words as $word) {
if (in_array($word, $words)) {
if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
}
}
$rating1 = 0;
foreach ($document1 as $temp) {
$rating1 = $rating1+square($temp);
}
$rating1 = sqrt($rating1);
// document 1 end
// document 2 start
$document2 = array();
foreach ($words_sub as $word_sub) {
if (in_array($word_sub, $words)) {
if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
}
}
$rating2 = 0;
foreach ($document2 as $temp) {
$rating2 = $rating2+square($temp);
}
$rating2 = sqrt($rating2);
// document 2 end
$skalarprodukt = 0;
for ($m=0; $m<count($words)-1; $m++) {
$skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
}
if (($rating1*$rating2) == 0) { continue; }
$kosinusmass = $skalarprodukt/($rating1*$rating2);
if ($kosinusmass < 0.7) {
return FALSE;
}
else {
return TRUE;
}
}
?>
I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.
I hope you can help me. Thanks in advance!
The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.
First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:
for word in article.text
article.histogram[indexLookup[word]]++
Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:
def check(articleA, articleB)
rtn = 0
for a,b in zip(articleA.histogram, articleB.histogram)
rtn += a*b
return rtn > threshold
(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)
This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.
Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.
By the way, this technique is described in more detail here
Maybe clustering is the wrong strategy here?
If you want to display similar articles, use similarity search instead.
For text articles, this is well understood. Just insert your articles in a text search database like Lucene, and use your current article as search query. In Lucene, there exists a query called MoreLikeThis that performs exactly this: find similar articles.
Clustering is the wrong tool, because (in particular with your requirements), every article must be put into some cluster; and the related items would be the same for every object in the cluster. If there are outliers in the database - a very likely case - they could ruin your clustering. Furthermore, clusters may be very big. There is no size constraint, the clustering algorithm may decide to put half of your data set into the same cluster. So you have 10000 related articles for each article in your database. With similarity search, you can just get the top-10 similar items for each document!
Last but not least: forget PHP for clustering. It's not designed for this, and not performant enough. But you can probably access a lucene index from PHP well enough.
I believe you need to make some design decisions about clustering, and continue from there:
Why are you clustering texts? Do you want to display related documents together? Do you want to explore your document corpus via clusters?
As a result, do you want flat or hierarchical clustering?
Now we have the complexity issue, in two dimensions: first, the number and type of features you create from the text - individual words may number in the tens of thousands. You may want to try some feature selection - such as taking the N most informative words, or the N words appearing the most times, after ignoring stop words.
Second, you want to minimize the number of times you measure similarity between documents. As bubaker correctly points out, checking similarity between all pairs of documents may be too much. If clustering into a small number of clusters is enough, you may consider K-means clustering, which is basically: choose an initial K documents as cluster centers, assign every document to the closest cluster, recalculate cluster centers by finding document vector means, and iterate. This only costs K*number of documents per iteration. I believe there are also heuristics for reducing the needed number of computations for hierarchical clustering as well.
What does the similar_text function called in Approach #1 look like? I think what you're referring to isn't clustering, but a similarity metric. I can't really improve on the White Walloun's :-) histogram approach - an interesting problem to do some reading on.
However you implement check(), you've got to use it to make at least 200M comparisons (half of 20000^2). The cutoff for "related" articles may limit what you store in the database, but seems too arbitrary to catch all useful clustering of texts,
My approach would be to modify check() to return the "similarity" metric ($prozent or rtn). Write the 20K x 20K matrix to a file and use an external program to perform a clustering to identify nearest neighbors for each article, which you could load into the related table. I would do the clustering in R - there's a nice tutorial for clustering data in a file running R from php.