Suppose we have 1000 items and a place to show any ten of these items at a time, to the visiting user. We can capture click rate and items which are shown together.
How can we optimally get the most popular items (say 10) out of these?
How can we continually update popularity and show the optimal 10 items?
Edit: I'm looking for the different approaches instead of implementations.
If you really want to squeeze this down, there is a dumb/simple approach for your case (show top 1%).
This optimization can happen because on average, only 1 out of 100 popularity changes will knock out one of the top 1%. (Assumes a random distribution of updates. Of course with a more typical power-law distribution, this could happen much more frequently.)
Sort the entire initial collection,
Store only the top 10 in any sorted data structure (e.g. BST)
Store the popularity score of #10 (e.g. minVisiblePopularity)
then with each subsequent popularity change in the collection, compare with minVisiblePopularity.
If the new popularity falls above the minVisiblePopularity, update the top-10 structure and minVisiblePopularity accordingly.
(Or if the old popularity was greater, but new popularity is less - e.g. former top 10 item getting knocked out).
This adds a minimal storage requirement of an extremely small binary search tree (10 items) and a primitive variable. The tree then only requires updating when a popularity change knocks out one of the previous top-10.
Self implemented:
To maintain ordered array by popularity and a hash table that contain reference to corresponding item in the popularity binary tree.
So, last 10 would the most popular items, access to them will be O(M) where M is count of items to show.
To maintain ordered array:
It can be maintained using self-balancing binary tree with log(N) complexity where N is total count of elements
http://www.sitepoint.com/data-structures-2/
As a practical option:
database can be used to store items and B-tree index can be added to popularity column; DBMS will have required optimizations here
https://en.wikipedia.org/wiki/Database_index
Related
I am looking for the approach or algorithm that can help with the following requirements:
Partition the elements into a defined number of X partitions. Number of partitions might be redefined manually over time if needed.
Each partition should not have more than Y elements
Elements have a "category Id" and "element Id". Ideally all elements with the same category Id should be within the same partition. They should overflow to as few partitions as possible only if a given category has more than Y elements. Number of categories is orders of magnitude larger than number of partitions.
If the element from the set has been previously assigned to a given partition it should continue being assigned to the same partition
Account for change in the data. Existing elements might be removed and new elements within each of the categories can be added.
So far my naive approach is to:
sort the categories descending by their number of elements
keep a variable with a count-of-elements for a given partition
assign the rows from the first category to the first partition and increase the count-of-elements
if count-of-elements > Y: assign overflowing elements to the next partition, but only if the number of elements in a category is bigger than Y. Otherwise assign all elements from a given category to the next partition
continue till all elements are assigned to partitions
In order to persist the assignments store in the database all pairs: (element Id, partition Id)
On the consecutive re-assignments:
remove from the database any elements that were deleted
assign existing elements to the partitions based on (element Id, partition Id)
for any new elements follow the above algorithm
My main worry is that after few such runs we will end up with categories spread all across the partitions as the initial partitions will get all filled in. Perhaps adding a buffer (of 20% or so) to Y might help. Also if one of the categories will see a sudden increase in a number of elements the partitions will need rebalancing.
Are there any existing algorithms that might help here?
This is NP hard (knapsack) on NP hard (finding optimal way to split too large categories) on currently unknown because of future data changes. Obviously the best that you can do is a heuristic.
Sort the categories by descending size. Using a heap/priority queue for the partitions, put each category into the least full available partition. If the category won't fit, then split it as evenly as you can into the smallest number of possible partitions. My guess (experiment!) is that trying to leave partitions at the same fill is best.
On reassignment, delete the deleted elements first. Then group new elements by category. Sort the categories by how many preferred locations they have ascending, and then by descending size. Now move the categories with 0 preferred locations to the end.
For each category, if possible split its new elements across the preferred partitions, leaving them equally full. If this is not possible, put them into the emptiest possible partition. If that is not possible, then split them to put them across the fewest possible partitions.
It is, of course, possible to come up with data sets that eventually turn this into a mess. But it makes a pretty good good faith effort to try to come out well.
I have N objects, and M sets of those objects. Sets are non-empty, different, and may intersect. Typically M and N are of the same order of magnitude, usually M > N.
Historically my sets were encoded as-is, each just contained a table (array) of its objects, but I'd like to create a more optimized encoding. Typically some objects present in most of the sets, and I want to utilize this.
My idea is to represent sets as stacks (i.e. single-directional linked lists), whereas their bottom parts can be shared across different sets. It can also be defined as a tree, whereas each node/leaf has a pointer to its parent, but not children.
Such a data structure will allow to use the most common subsets of objects as roots, which all the appropriate sets may "inherit".
The most efficient encoding is computed by the following algorithm. I'll write it as a recursive pseudo-code.
BuildAllChains()
{
BuildSubChains(allSets, NULL);
}
BuildSubChains(sets, pParent)
{
if (sets is empty)
return;
trgObj = the most frequent object from sets;
pNode = new Node;
pNode->Object = trgObj;
pNode->pParent = pParent;
newSets = empty;
for (each set in sets that contains the trgObj)
{
remove trgObj from set;
remove set from sets;
if (set is empty)
set->pHead = pNode;
else
newSets.Insert(set);
}
BuildSubChains(sets, pParent);
BuildSubChains(newSets, pNode);
}
Note: the pseudo-code is written in a recursive manner, but technically naive recursion should not be used, because at each point the splitting is not balanced, and in a degenerate case (which is likely, since the source data isn't random) the recursion depth would be O(N).
Practically I use a combination of loop + recursion, whereas recursion always invoked on a smaller part.
So, the idea is to select each time the most common object, create a "subset" which inherits its parent subset, and all the sets that include it, as well as all the predecessors selected so far - should be based on this subset.
Now, I'm trying to figure-out an effective way to select the most frequent object from the sets. Initially my idea was to compute the histogram of all the objects, and sort it once. Then, during the recursion, whenever we remove an object and select only sets that contain/don't contain it - deduce the sorted histogram of the remaining sets. But then I realized that this is not trivial, because we remove many sets, each containing many objects.
Of course we can select each time the most frequent object directly, i.e. O(N*M). But it also looks inferior, in a degenerate case, where an object exists in either almost all or almost none sets we may need to repeat this O(N) times. OTOH for those specific cases in-place adjustment of the sorted histogram may be preferred way to go.
So far I couldn't come up with a good enough solution. Any ideas would be appreciated. Thanks in advance.
Update:
#Ivan: first thanks a lot for the answer and the detailed analysis.
I do store the list of elements within the histogram rather than the count only. Actually I use pretty sophisticated data structures (not related to STL) with intrusive containers, corss-linked pointers and etc. I planned this from the beginning, because than it seemed to me that the histogram adjustment after removing elements would be trivial.
I think the main point of your suggestion, which I didn't figure-out myself, is that at each step the histograms should only contain elements that are still present in the family, i.e. they must not contain zeroes. I thought that in cases where the splitting is very uneven creating a new histogram for the smaller part is too expensive. But restricting it to only existing elements is a really good idea.
So we remove sets of the smaller family, adjust the "big" histogram and build the "small" one. Now, I need some clarifications about how to keep the big histogram sorted.
One idea, which I thought about first, was immediate fix of the histogram after every single element removal. I.e. for every set we remove, for every object in the set, remove it from the histogram, and if the sort is broken - swap the histogram element with its neighbor until the sort is restored.
This seems good if we remove small number of objects, we don't need to traverse the whole histogram, we do a "micro-bubble" sort.
However when removing large number of objects it seems better to just remove all the objects and then re-sort the array via quick-sort.
So, do you have a better idea regarding this?
Update2:
I think about the following: The histogram should be a data structure which is a binary search tree (auto-balanced of course), whereas each element of the tree contains the appropriate object ID and the list of the sets it belongs to (so far). The comparison criteria is the size of this list.
Each set should contain the list of objects it contains now, whereas the "object" has the direct pointer to the element histogram. In addition each set should contain the number of objects matched so far, set to 0 at the beginning.
Technically we need a cross-linked list node, i.e. a structure that exists in 2 linked lists simultaneously: in the list of a histogram element, and in the list of the set. This node also should contain pointers to both the histogram item and the set. I call it a "cross-link".
Picking the most frequent object is just finding the maximum in the tree.
Adjusting such a histogram is O(M log(N)), whereas M is the number of elements that are currently affected, which is smaller than N if only a little number is affected.
And I'll also use your idea to build the smaller histogram and adjust the bigger.
Sounds right?
I denote the total size of sets with T. The solution I present works in time O(T log T log N).
For the clarity I denote with set the initial sets and with family the set of these sets.
Indeed, let's store a histogram. In BuildSubChains function we maintain a histogram of all elements which are presented in the sets at the moment, sorted by frequency. It may be something like std::set of pairs (frequency, value), maybe with cross-references so you could find an element by value. Now taking the most frequent element is straightforward: it is the first element in the histogram. However, maintaining it is trickier.
You split your family of sets into two subfamilies, one containing the most frequent element, one not. Let there total sizes be T' and T''. Take the family with the smallest total size and remove all elements from its sets from the histogram, making the new histogram on the run. Now you have a histogram for both families, and it is built in time O(min(T', T'') log n), where log n comes from operations with std::set.
At the first glance it seems that it works in quadratic time. However, it is faster. Take a look at any single element. Every time we explicitly remove this element from the histogram the size of its family at least halves, so each element will directly participate in no more than log T removals. So there will be O(T log T) operations with histograms in total.
There might be a better solution if I knew the total size of sets. However, no solution can be faster than O(T), and this is only logarithmically slower.
There may be one more improvement: if you store in the histogram not only elements and frequencies, but also the sets that contain the element (simply another std::set for each element) you'll be able to efficiently select all sets that contain the most frequent element.
I have a list with 'n' elements (lets say 10 ie), I want to distribute this elements into two lists, each one balanced with the other by a criteria, evaluating the valour of each element. ie The output should be two lists with 5 elements that are aproximately balanced with each other.
Thanks for your time.
You could employ a greedy strategy (I'm not certain this will give you "optimal" results, but it should give you relatively good results at least).
Start by finding the total value of all the elements in your list, V. The goal is to create two lists each with value about half this amount (ideally as close to 1/2*V as possible). Start with 3 lists, List_original, List_1, List_2.
Pull off items from List_original (starting with the largest, working your way down to the smallest) and put them into List_1 if and only if adding them to List_1 doesn't cause the total value of List_1 to exceed 1/2*V. Everything else goes into List_2.
The result will be that List_1 will be at most 1/2*V and List_2 will be at least 1/2*V. In the event that some subset of your items sums up to exactly 1/2*V then you might get equality. I haven't tried to prove/disprove this yet. Depending on how close to balanced your result has to be, this could be good enough (it should at least be very fast).
I came up with a quick "solution" by taking the full averagevalue of the list, then ordering it asc, taking the two highest values for each list and then iterate with the rest. With each iteration I compared the average of the full list with the average of each of the two sublists with the added iteration, each time I put the iteration element in the list wich average was closer to the full average of the list. Keep doing it until the list were full.
I know it is not the best choice but it was good enough for now.
Hope my explanation was clear enough.
Thanks to all.
I know about creating hashcodes, collisions, the relationship between .GetHashCode and .Equals, etc.
What I don't quite understand is how a 32 bit hash number is used to get the ~O(1) lookup. If you have an array big enough to allocate all the possibilities in a 32bit number then you do get the ~O(1) but that would be waste of memory.
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup. When the number of elements reaches a certain threshold (say 75%) it would expand the array to something like 10K items and recompute the internal hash numbers to 4 digit numbers, based on the 32bit hash of course.
btw, here I'm using ~O(1) to account for possible collisions and their resolutions.
Do I have the gist of it correct or am I completely off the mark?
My guess is that internally the Hashtable class creates a small array (e.g. 1K items) and then rehash the 32bit number to a 3 digit number and use that as lookup.
That's exactly what happens, except that the capacity (number of bins) of the table is more commonly set to a power of two or a prime number. The hash code is then taken modulo this number to find the bin into which to insert an item. When the capacity is a power of two, the modulus operation becomes a simple bitmasking op.
When the number of elements reaches a certain threshold (say 75%)
If you're referring to the Java Hashtable implementation, then yes. This is called the load factor. Other implementations may use 2/3 instead of 3/4.
it would expand the array to something like 10K items
In most implementations, the capacity will not be increased ten-fold but rather doubled (for power-of-two-sized hash tables) or multiplied by roughly 1.5 + the distance to the next prime number.
The hashtable has a number of bins that contain items. The number of bins are quite small to start with. Given a hashcode, it simply uses hashcode modulo bincount to find the bin in which the item should reside. That gives the fast lookup (Find the bin for an item: Take modulo of the hashcode, done).
Or in (pseudo) code:
int hash = obj.GetHashCode();
int binIndex = hash % binCount;
// The item is in bin #binIndex. Go get the items there and find the one that matches.
Obviously, as you figured out yourself, at some point the table will need to grow. When it does this, a new array of bins are created, and the items in the table are redistributed to the new bins. This is also means that growing a hashtable can be slow. (So, approx. O(1) in most cases, unless the insert triggers an internal resize. Lookups should always be ~O(1)).
In general, there are a number of variations in how hash tables handle overflow.
Many (including Java's, if memory serves) resize when the load factor (percentage of bins in use) exceeds some particular percentage. The downside of this is that the speed is undependable -- most insertions will be O(1), but a few will be O(N).
To ameliorate that problem, some resize gradually instead: when the load factor exceeds the magic number, they:
Create a second (larger) hash table.
Insert the new item into the new hash table.
Move some items from the existing hash table to the new one.
Then, each subsequent insertion moves another chunk from the old hash table to the new one. This retains the O(1) average complexity, and can be written so the complexity for every insertion is essentially constant: when the hash table gets "full" (i.e., load factor exceeds your trigger point) you double the size of the table. Then, each insertion you insert the new item and move one item from the old table to the new one. The old table will empty exactly as the new one fills up, so every insertion will involve exactly two operations: inserting one new item and moving one old one, so insertion speed remains essentially constant.
There are also other strategies. One I particularly like is to make the hash table a table of balanced trees. With this, you usually ignore overflow entirely. As the hash table fills up, you just end up with more items in each tree. In theory, this means the complexity is O(log N), but for any practical size it's proportional to log N/M, where M=number of buckets. For practical size ranges (e.g., up to several billion items) that's essentially constant (log N grows very slowly) and and it's often a little faster for the largest table you can fit in memory, and a lost faster for smaller sizes. The shortcoming is that it's only really practical when the objects you're storing are fairly large -- if you stored (for example) one character per node, the overhead from two pointers (plus, usually, balance information) per node would be extremely high.
I'm designing a piece of a game where the AI needs to determine which combination of armor will give the best overall stat bonus to the character. Each character will have about 10 stats, of which only 3-4 are important, and of those important ones, a few will be more important than the others.
Armor will also give a boost to 1 or all stats. For example, a shirt might give +4 to the character's int and +2 stamina while at the same time, a pair of pants may have +7 strength and nothing else.
So let's say that a character has a healthy choice of armor to use (5 pairs of pants, 5 pairs of gloves, etc.) We've designated that Int and Perception are the most important stats for this character. How could I write an algorithm that would determine which combination of armor and items would result in the highest of any given stat (say in this example Int and Perception)?
Targeting one statistic
This is pretty straightforward. First, a few assumptions:
You didn't mention this, but presumably one can only wear at most one kind of armor for a particular slot. That is, you can't wear two pairs of pants, or two shirts.
Presumably, also, the choice of one piece of gear does not affect or conflict with others (other than the constraint of not having more than one piece of clothing in the same slot). That is, if you wear pants, this in no way precludes you from wearing a shirt. But notice, more subtly, that we're assuming you don't get some sort of synergy effect from wearing two related items.
Suppose that you want to target statistic X. Then the algorithm is as follows:
Group all the items by slot.
Within each group, sort the potential items in that group by how much they boost X, in descending order.
Pick the first item in each group and wear it.
The set of items chosen is the optimal loadout.
Proof: The only way to get a higher X stat would be if there was an item A which provided more X than some other in its group. But we already sorted all the items in each group in descending order, so there can be no such A.
What happens if the assumptions are violated?
If assumption one isn't true -- that is, you can wear multiple items in each slot -- then instead of picking the first item from each group, pick the first Q(s) items from each group, where Q(s) is the number of items that can go in slot s.
If assumption two isn't true -- that is, items do affect each other -- then we don't have enough information to solve the problem. We'd need to know specifically how items can affect each other, or else be forced to try every possible combination of items through brute force and see which ones have the best overall results.
Targeting N statistics
If you want to target multiple stats at once, you need a way to tell "how good" something is. This is called a fitness function. You'll need to decide how important the N statistics are, relative to each other. For example, you might decide that every +1 to Perception is worth 10 points, while every +1 to Intelligence is only worth 6 points. You now have a way to evaluate the "goodness" of items relative to each other.
Once you have that, instead of optimizing for X, you instead optimize for F, the fitness function. The process is then the same as the above for one statistic.
If, there is no restriction on the number of items by category, the following will work for multiple statistics and multiple items.
Data preparation:
Give each statistic (Int, Perception) a weight, according to how important you determine it is
Store this as a 1-D array statImportance
Give each item-statistic combination a value, according to how much said item boosts said statistic for the player
Store this as a 2-D array itemStatBoost
Algorithm:
In pseudocode. Here assume that itemScore is a sortable Map with Item as the key and a numeric value as the value, and values are initialised to 0.
Assume that the sort method is able to sort this Map by values (not keys).
//Score each item and rank them
for each statistic as S
for each item as I
score = itemScore.get(I) + (statImportance[S] * itemStatBoost[I,S])
itemScore.put(I, score)
sort(itemScore)
//Decide which items to use
maxEquippableItems = 10 //use the appropriate value
selectedItems = new array[maxEquippableItems]
for 0 <= idx < maxEquippableItems
selectedItems[idx] = itemScore.getByIndex(idx)