I have a simple google sheet where each row represents a node in a tree that holds a reference to its parent and some descriptor values about it. I would like to have a column that sums the child nodes beneath the current one.
e.g:
Node ID, Parent Node ID, Minimum Value, Self Value, Total Value
1, 0, 30, 10, 90
2, 1, 10, 20, 40
3, 1, 10, 20, 40
4, 2, 1, 10, 10
5, 3, 1, 10, 10
6, 3, 1, 10, 10
7, 2, 1, 10, 10
Where Self Value is statically defined, and Total Value represents Self Value + SUM(CHILDREN.Total Value). Do I need to re-organize the sheet to accomplish this or am I missing the proper way to recursively sum-up the child rows?
Introduction
tl,dr: this method works but is impractical for large complex datasets.
This seems to be rather complicated for what Sheets or similar software are designed, I'm think it would be probably much easier to solve in Apps Script (where I have no experience) or any other scripting language outside of Sheets.
Meanwhile I have come up with solution that works using only formulas in Sheets. It has some limitations however: the two formulas have to be manually extended (details below), and it would be cumbersome to use for a very large depth of the dataset (by depth I mean the maximal amount of generations of children nodes).
I have reorganized the columns in your example dataset to make this easier to understand and added two more rows to test it better. I have removed the Minimum Value column, as per your question, it is not relevant to the expected results.
SelfValue
NodeID
Parent
10
1
0
20
2
1
20
3
1
10
4
2
10
5
3
10
6
3
10
7
2
5
8
7
5
9
8
Solution and explanation
My main idea was that it is relatively easy to calculate Total Value of a given node if we know its children in all generations (not just its immediate children, but also "grandchildren" and so on) and their Self Value.
In particular, to know Total Value of a node, we do not need to have explicitly calculated the Total Value of its immediate children.
I have not found a simple way to enumerate children from all generations for a given node. I have approached it by finding the parents of the parents, and so on, for all nodes instead. To do this, I have entered the following formula in D2 and then manually extended this formula across the next columns up to column H (the first column to show only empty values):
=ARRAYFORMULA(IFERROR(VLOOKUP(C2:C,$B$2:$C,2,false)))
I attempted to make it automatically fill multiple columns without manually extending, but this gave me the circular dependence error.
The next and final step is to calculate the Total Value of all nodes, now that we have a way to identify all of their children (in all generations). I entered the following formula in cell I2 and then manually extended it down across all rows:
=IFERROR(SUM(FILTER(A$2:A,BYROW(B$2:H,LAMBDA(row,NOT(ISERROR(MATCH(B2,row,0))))))))
This calculates the Total Value by adding Self Value of all nodes, for which the given node is parent (in any generation) and the Self Value of the given node itself.
The range B$2:H has to be adapted, if the dataset is deeper and there are more columns filled with the first formula.
Here is the final result, I have colorized the cells where the two formulas are entered (green, yellow) and extended (light green, light yellow):
It seems it would be more efficient (less calculations in the background, more responsive sheet) by using QUERY, but then all the columns C-H need to be explicitly listed like select sum(A) where B="&B2&" or C="&B2&" or ..., so it becomes a problem in itself to construct this formula and adapt to a variable number of columns from the previous step.
I attempted to make the formula automatically fill all rows (instead of manually expanding) by experimenting with ARRAYFORMULA or MAP(LAMBDA), but it either didn't work or exceeded the calculation limit.
Anyway it would be interesting to see if there is another simpler solution to it using only formulas. Also it surely can be done more efficiently and elegantly using Apps Script.
Related
Mark has a collection of N postage stamps. Each stamp belongs to some type, which are enumerated as positive integers. More valuable stamps have a higher enumerated type.
On any particular day, E-bay lists several offers, each of which is represented as an unordered pair {A, B}, allowing its users to exchange stamps of type A with an equal number of stamps of type B. Mark can use such an offer to put up any number of stamps of enumerated type A on the website and get the same number of stamps of type B in return, or vice-versa . Assume that any number of stamps Mark wants are always available on the site's exchange market. Each offer is open during only one day: Mark can't use it after this day, but he can use it several times during this day. If there are some offers which are active during a given day, Mark can use them in any order.
Find maximum possible value of his collection after going through (accepting or declining) all the offers. Value of Mark's collection is equal to the sum of type enumerations of all stamps in the collection.
How dynamic programming lead to the solution for the problem ? (Mark knows what offers will come in future)
I would maintain a table that gives, for each type, the maximum value that you can get for a member of that type using only the last N swaps.
To compute this for N=0 just put down the value of each type without swaps.
To compute this for N=i+1 look at the ith swap and the table for N=i. The i-th swap is for two offsets in that table, which probably have different values. Because you can use the i-th swap, you can alter the table to set the lower value of the two equal to the higher value of the two.
When you have a table taking into account all the swaps you can sum up the values for the types that Mark is starting with to get the answer.
Example tables for the swaps {4, 5}, {5, 3},{3, 1}, {1, 20}
1 2 3 4 5 .. 20
20 2 3 4 5 .. 20
20 2 20 3 4 .. 20
20 2 20 3 20 .. 20
20 2 20 20 20 .. 20
Example for swaps {1, 5} and then {1, 20}
1 2 3 4 5 .. 20
20 2 3 4 5 .. 20
20 2 3 4 20 .. 20
Note that i=1 means take account of the last swap possible, so we are working backwards as far as swaps are concerned. The final table reflects the fact that 5 can be swapped for 1 before 1 is swapped for 20. You can work out a schedule of which swaps to do when by looking at what swap is available at time i and which table entries change at this time.
Dynamic Programming means simplifying a problem into smaller sub-sequences of problems. Your problem is well defined as a value ordered collection of stamps of different types. So, Value(T1) < Value(T2) .. Value(Tn-1)
Finding the maximum value of the collection will be determined by the opportunities to swap pairs of types. Of course, we only want to swap pairs when it will increase the total value of the collection.
Therefore, we define a simple swap operation where we will swap if the collection contains stamps of the lower valued stamp in the swap opportunity.
If sufficient opportunities of differing types are offered, then the collection could ultimately contain all stamps at the highest value.
My suggestion is to create a collection data structure, a simple conditioned swap function and perhaps an event queue which responds to swap events.
Dynamic Table
Take a look at this diagram which shows how I would set up my data. The key is to start from the last row and work backwards computing the best deals, then moving forward and taking the best going forward.
How to form a combination of say 10 questions so that each student (total students = 10) get unique combination.
I don't want to use factorial.
you can use circular queue data structure
now you can cut this at any point you like , and it then it will give you a unique string
for example , if you cut this at point between 2 and 3 and then iterate your queue, you will get :
3, 4, 5, 6, 7, 8, 9, 10, 1, 2
so you need to implement a circular queue, then cut it from 10 different points (after 1, after 2[shown in picture 2],after 3,....)
There are 3,628,800 different permutations of 10 items taken 10 at a time.
If you only need 10 of them you could start with an array that has the values 1-10 in it. Then shuffle the array. That becomes your first permutation. Shuffle the array again and check to see that you haven't already generated that permutation. Repeat that process: shuffle, check, save, until you have 10 unique permutations.
It's highly unlikely (although possible) that you'll generate a duplicate permutation in only 10 tries.
The likelihood that you generate a duplicate increases as you generate more permutations, increasing to 50% by the time you've generated about 2,000. But if you just want a few hundred or less, then this method will do it for you pretty quickly.
The proposed circular queue technique works, too, and has the benefit of simplicity, but the resulting sequences are simply rotations of the original order, and it can't produce more than 10 without a shuffle. The technique I suggest will produce more "random" looking orderings.
I have been sitting on this for almost a week now. Here is the question in a PDF format.
I could only think of one idea so far but it failed. The idea was to recursively create all connected subgraphs which works in O(num_of_connected_subgraphs), but that is way too slow.
I would really appreciate someone giving my a direction. I'm inclined to think that the only way is dynamic programming but I can't seem to figure out how to do it.
OK, here is a conceptual description for the algorithm that I came up with:
Form an array of the (x,y) board map from -7 to 7 in both dimensions and place the opponents pieces on it.
Starting with the first row (lowest Y value, -N):
enumerate all possible combinations of the 2nd player's pieces on the row, eliminating only those that conflict with the opponents pieces.
for each combination on this row:
--group connected pieces into separate networks and number these
networks starting with 1, ascending
--encode the row as a vector using:
= 0 for any unoccupied or opponent position
= (1-8) for the network group that that piece/position is in.
--give each such grouping a COUNT of 1, and add it to a dictionary/hashset using the encoded vector as its key
Now, for each succeeding row, in ascending order {y=y+1}:
For every entry in the previous row's dictionary:
--If the entry has exactly 1 group, add it's COUNT to TOTAL
--enumerate all possible combinations of the 2nd player's pieces
on the current row, eliminating only those that conflict with the
opponents pieces. (change:) you should skip the initial combination
(where all entries are zero) for this step, as the step above actually
covers it. For each such combination on the current row:
+ produce a grouping vector as described above
+ compare the current row's group-vector to the previous row's
group-vector from the dictionary:
++ if there are any group-*numbers* from the previous row's
vector that are not adjacent to any gorups in the current
row's vector, *for at least one value of X*, then skip
to the next combination.
++ any groups for the current row that are adjacent to any
groups of the previous row, acquire the lowest such group
number
++ any groups for the current row that are not adjacent to
any groups of the previous row, are assigned an unused
group number
+ Re-Normalize the group-number assignments for the current-row's
combination (**) and encode the vector, giving it a COUNT equal
to the previous row-vector's COUNT
+ Add the current-row's vector to the dictionary for the current
Row, using its encoded vector as the key. If it already exists,
then add it's COUNT to the COUNT for the pre-exising entry
Finally, for every entry in the dictionary for the last row:
If the entry has exactly one group, then add it's COUNT to TOTAL
**: Re-Normalizing simply means to re-assign the group numbers so as to eliminate any permutations in the grouping pattern. Specifically, this means that new group numbers should be assigned in increasing order, from left-to-right, starting from one. So for example, if your grouping vector looked like this after grouping ot to the previous row:
2 0 5 5 0 3 0 5 0 7 ...
it should be re-mapped to this normal form:
1 0 2 2 0 3 0 2 0 4 ...
Note that as in this example, after the first row, the groupings can be discontiguous. This relationship must be preserved, so the two groups of "5"s are re-mapped to the same number ("2") in the re-normalization.
OK, a couple of notes:
A. I think that this approach is correct , but I I am really not certain, so it will definitely need some vetting, etc.
B. Although it is long, it's still pretty sketchy. Each individual step is non-trivial in itself.
C. Although there are plenty of individual optimization opportunities, the overall algorithm is still pretty complicated. It is a lot better than brute-force, but even so, my back-of-the-napkin estimate is still around (2.5 to 10)*10^11 operations for N=7.
So it's probably tractable, but still a long way off from doing 74 cases in 3 seconds. I haven't read all of the detail for Peter de Revaz's answer, but his idea of rotating the "diamond" might be workable for my algorithm. Although it would increase the complexity of the inner loop, it may drop the size of the dictionaries (and thus, the number of grouping-vectors to compare against) by as much as a 100x, though it's really hard to tell without actually trying it.
Note also that there isn't any dynamic programming here. I couldn't come up with an easy way to leverage it, so that might still be an avenue for improvement.
OK, I enumerated all possible valid grouping-vectors to get a better estimate of (C) above, which lowered it to O(3.5*10^9) for N=7. That's much better, but still about an order of magnitude over what you probably need to finish 74 tests in 3 seconds. That does depend on the tests though, if most of them are smaller than N=7, it might be able to make it.
Here is a rough sketch of an approach for this problem.
First note that the lattice points need |x|+|y| < N, which results in a diamond shape going from coordinates 0,6 to 6,0 i.e. with 7 points on each side.
If you imagine rotating this diamond by 45 degrees, you will end up with a 7*7 square lattice which may be easier to think about. (Although note that there are also intermediate 6 high columns.)
For example, for N=3 the original lattice points are:
..A..
.BCD.
EFGHI
.JKL.
..M..
Which rotate to
A D I
C H
B G L
F K
E J M
On the (possibly rotated) lattice I would attempt to solve by dynamic programming the problem of counting the number of ways of placing armies in the first x columns such that the last column is a certain string (plus a boolean flag to say whether some points have been placed yet).
The string contains a digit for each lattice point.
0 represents an empty location
1 represents an isolated point
2 represents the first of a new connected group
3 represents an intermediate in a connected group
4 represents the last in an connected group
During the algorithm the strings can represent shapes containing multiple connected groups, but we reject any transformations that leave an orphaned connected group.
When you have placed all columns you need to only count strings which have at most one connected group.
For example, the string for the first 5 columns of the shape below is:
....+ = 2
..+++ = 3
..+.. = 0
..+.+ = 1
..+.. = 0
..+++ = 3
..+++ = 4
The middle + is currently unconnected, but may become connected by a later column so still needs to be tracked. (In this diagram I am also assuming a up/down/left/right 4-connectivity. The rotated lattice should really use a diagonal connectivity but I find that a bit harder to visualise and I am not entirely sure it is still a valid approach with this connectivity.)
I appreciate that this answer is not complete (and could do with lots more pictures/explanation), but perhaps it will prompt someone else to provide a more complete solution.
I'm currently implementing an algorithm where one particular step requires me to calculate subsets in the following way.
Imagine I have sets (possibly millions of them) of integers. Where each set could potentially contain around a 1000 elements:
Set1: [1, 3, 7]
Set2: [1, 5, 8, 10]
Set3: [1, 3, 11, 14, 15]
...,
Set1000000: [1, 7, 10, 19]
Imagine a particular input set:
InputSet: [1, 7]
I now want to quickly calculate to which this InputSet is a subset. In this particular case, it should return Set1 and Set1000000.
Now, brute-forcing it takes too much time. I could also parallelise via Map/Reduce, but I'm looking for a more intelligent solution. Also, to a certain extend, it should be memory-efficient. I already optimised the calculation by making use of BloomFilters to quickly eliminate sets to which the input set could never be a subset.
Any smart technique I'm missing out on?
Thanks!
Well - it seems that the bottle neck is the number of sets, so instead of finding a set by iterating all of them, you could enhance performance by mapping from elements to all sets containing them, and return the sets containing all the elements you searched for.
This is very similar to what is done in AND query when searching the inverted index in the field of information retrieval.
In your example, you will have:
1 -> [set1, set2, set3, ..., set1000000]
3 -> [set1, set3]
5 -> [set2]
7 -> [set1, set7]
8 -> [set2]
...
EDIT:
In inverted index in IR, to save space we sometimes use d-gaps - meaning we store the offset between documents and not the actual number. For example, [2,5,10] will become [2,3,5]. Doing so and using delta encoding to represent the numbers tends to help a lot when it comes to space.
(Of course there is also a downside: you need to read the entire list in order to find if a specific set/document is in it, and cannot use binary search, but it sometimes worths it, especially if it is the difference between fitting the index into RAM or not).
How about storing a list of the sets which contain each number?
1 -- 1, 2, 3, 1000000
3 -- 1, 3
5 -- 2
etc.
Extending amit's solution, instead of storing the actual numbers, you could just store intervals and their associated sets.
For example using a interval size of 5:
(1-5): [1,2,3,1000000]
(6-10): [2,1000000]
(11-15): [3]
(16-20): [1000000]
In the case of (1,7) you should consider intervals (1-5) and (5-10) (which can be determined simply by knowing the size of the interval). Intersecting those ranges gives you [2,1000000]. Binary search of the sets shows that indeed, (1,7) exists in both sets.
Though you'll want to check the min and max values for each set to get a better idea of what the interval size should be. For example, 5 is probably a bad choice if the min and max values go from 1 to a million.
You should probably keep it so that a binary search can be used to check for values, so the subset range should be something like (min + max)/N, where 2N is the max number of values that will need to be binary searched in each set. For example, "does set 3 contain any values from 5 to 10?" this is done by finding the closest values to 5 (3) and 10 (11), in this case, no it does not. You would have to go through each set and do binary searches for the interval values that could be within the set. This means ensuring that you don't go searching for 100 when the set only goes up to 10.
You could also just store the range (min and max). However, the issue is that I suspect your numbers are going be be clustered, thus not providing much use. Although as mentioned, it'll probably be useful for determining how to set up the intervals.
It'll still be troublesome to pick what range to use, too large and it'll take a long time to build the data structure (1000 * million * log(N)). Too small, and you'll start to run into space issues. The ideal size of the range is probably such that it ensures that the number of set's related to each range is approximately equal, while also ensuring that the total number of ranges isn't too high.
Edit:
One benefit is that you don't actually need to store all intervals, just the ones you need. Although, if you have too many unused intervals, it might be wise to increase the interval and split the current intervals to ensure that the search is fast. This is especially true if processioning time isn't a major issue.
Start searching from biggest number (7) of input set and
eliminate other subsets (Set1 and Set1000000 will returned).
Search other input elements (1) in remaining sets.
I want to pick the top "range" of cards based upon a percentage. I have all my possible 2 card hands organized in an array in order of the strength of the hand, like so:
AA, KK, AKsuited, QQ, AKoff-suit ...
I had been picking the top 10% of hands by multiplying the length of the card array by the percentage which would give me the index of the last card in the array. Then I would just make a copy of the sub-array:
Arrays.copyOfRange(cardArray, 0, 16);
However, I realize now that this is incorrect because there are more possible combinations of, say, Ace King off-suit - 12 combinations (i.e. an ace of one suit and a king of another suit) than there are combinations of, say, a pair of aces - 6 combinations.
When I pick the top 10% of hands therefore I want it to be based on the top 10% of hands in proportion to the total number of 2 cards combinations - 52 choose 2 = 1326.
I thought I could have an array of integers where each index held the combined total of all the combinations up to that point (each index would correspond to a hand from the original array). So the first few indices of the array would be:
6, 12, 16, 22
because there are 6 combinations of AA, 6 combinations of KK, 4 combinations of AKsuited, 6 combinations of QQ.
Then I could do a binary search which runs in BigOh(log n) time. In other words I could multiply the total number of combinations (1326) by the percentage, search for the first index lower than or equal to this number, and that would be the index of the original array that I need.
I wonder if there a way that I could do this in constant time instead?
As Groo suggested, if precomputation and memory overhead permits, it would be more efficient to create 6 copies of AA, 6 copies of KK, etc and store them into a sorted array. Then you could run your original algorithm on this properly weighted list.
This is best if the number of queries is large.
Otherwise, I don't think you can achieve constant time for each query. This is because the queries depend on the entire frequency distribution. You can't look only at a constant number of elements to and determine if it's the correct percentile.
had a similar discussion here Algorithm for picking thumbed-up items As a comment to my answer (basically what you want to do with your list of cards), someone suggested a particular data structure, http://en.wikipedia.org/wiki/Fenwick_tree
Also, make sure your data structure will be able to provide efficient access to, say, the range between top 5% and 15% (not a coding-related tip though ;).