Related
I've recently gotten back into learning about discrete math. I enrolled in a course at university and am having trouble getting the hang of things again, especially when it comes to deriving a recurrence relation from a word problem. I would love to have some tips on how to do so.
For example (I've changed numbers from the homework question so if this doesn't work out just let me know): if Jean divides an input of size n into three subsets each of size n/5 and combines them in theta(n) time, what is the runtime? I got 3T(n/5) + theta(n) as the recurrence relation and I have no idea what the runtime is, and I feel like those are both incorrect.
I found this site (https://users.cs.duke.edu/~ola/ap/recurrence.html) to be helpful for breaking down a recurrence relation into a solid runtime, but I still don't get how to get the recurrence relation from the word problem in the first place. Thanks!
Think of such problems as tree structure with nodes on each level.Each node will contain a number the size of problem that you are dealing at a particular time and at each level you will have some nodes. This number could be 1,2,.......upto maximum n nodes at each level.
Now start from top level. You will have 1 node and the value in it will be 'n'(because at starting point we will have 'n' elements to deal with).
Now coming down to second level. In the above question says divide the problem(elements) in three parts at any point of time, so number of nodes on level 2 will be 3. The value in each node will be 'n/5'(because question says size of each subset is number of elements which were present on parent node divide by 5).Tree will look like:-
(n)
| | |
(n/5)(n/5) (n/5)
Now going further down to 3rd level, tree will look like
(n) level(1)
| | |
(n/5) (n/5) (n/5) level(2)
| | | | | | | | |
(n/25)(n/25)(n/25) (n/25)(n/25)(n/25) (n/25)(n/25)(n/25) level(3)
you will go on till the last level which will contain only 1 element and total number of nodes will be 'n'.
So, if you need to write the recursion just see level 1 and level 2.
Time taken to solve problem with 'n1' element is written as T(n1).
Time taken to solve problem with 'n2' element is written as T(n2).
Now number of elements in level 1 is n1=n
(Time taken to solve problem on first level)=(Time taken to solve 1st
node of level 2)+(Time taken to solve 2nd node of level 2)+(Time taken
to solve last node of level 2) + (It also takes time to combine these
three nodes given in question i.e. theta(number of total elements(n))
T(n)=T(n/5)+T(n/5)+T(n/5)+theta(n)
=>T(n)=3T(n/5)+theta(n)
According to Robert Sedwick, shell sort (supposed to run faster than insertion sort) tries to minimize the inversion distance with different h-sortings.
In a way , this h-sorting procedure makes file nearly sorted hence rearrange inversion distribution in more symmetric way.
Then how can one say (according to book), insertion sort run time depends on number of inversions & not on their distribution manner?
In insertion sort, every swap that's made reduces the number of inversions by exactly one. Imagine, for example, that we're about to swap two adjacent elements B and A in insertion sort. Right before the swap, the array looks something like this:
+--------------+---+---+------------+
| before | B | A | after |
+--------------+---+---+------------+
And, right afterwards, it looks like this:
+--------------+---+---+------------+
| before | A | B | after |
+--------------+---+---+------------+
Now, think about the inversions in the array. Any inversion purely in "before" or "after" is still there. Every inversion from "before" into "after" is still there, as are inversions from "before" into A, "before" into B, A into "after," and B into "after." The only inversion that's gone is the specific inversion pair (A, B). Consequently, the number of swaps in insertion sort is exactly equal to the number of inversions, since each inversion requires one swap and the algorithm stops when no inversions are left. Notice that it's just the total number of inversions that matters, not where they are.
On the other hand, this is not true about shellsort. Suppose in shellsort that we swap elements B and A, which are out of place but not adjacent. Schematically, right before the swap we have something like this:
+--------------+---+----------+---+------------+
| before | B | middle | A | after |
+--------------+---+----------+---+------------+
And we end with this:
+--------------+---+----------+---+------------+
| before | A | middle | B | after |
+--------------+---+----------+---+------------+
The inversion (B, A) is now gone, but it's also quite possible that even more inversions were eliminated with this step. For example, suppose there are a bunch of elements in "middle" that are less than B. That single swap would then eliminate all of them at the same time.
Because each swap in shellsort can potentially eliminate multiple inversions, the actual locations of those inversions does actually matter for the runtime, not just their position.
Not an answer per se: shell sort actually does require fewer comparisons on average than insertion sort and possibly all other sort algorithms PROVIDED you supply it with the correct gap sequence that, in turn, is a function of n (the number of elements to be sorted). There is probably just one (unique) optimal gap sequence for every n.
Defining F(n) is, of course, the tricky part!
I have an algorithm based on Longest Increasing Subsequence that works very well for solving the relevant business problem when the objects in each collection are unique, but tends to give odd results when there are many non-unique objects present in both collections.
It appears that an approach using the Patience Diff algorithm (which is also based on Longest Increasing Subsequence) would provide the results I want when non-unique objects exist. However, before I can figure out if Patience Diff would be suitable, and in order to apply it to my problem if it's suitable, I need a better understanding of the algorithm.
I understand what happens in steps 1 to 3, but I'm not clear on what happens in step 4. After 1 to 3, now there remains blocks of unique lines that have no possible match, and non-unique lines. So what happens next -- suppose there is no match with the remaining first/last lines of the documents, surely it doesn't terminate already (because there are no more unique lines)? Or does it compare every non-unique block in one document with every non-unique block in the other document and pick the best match somehow?
http://bramcohen.livejournal.com/73318.html
Match the first lines of both if they're identical, then match the second, third, etc. until a pair doesn't match.
Match the last lines of both if they're identical, then match the next to last, second to last, etc. until a pair doesn't match.
Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up.
Do steps 1-2 on each section between matched lines
Once you've run out of unique lines you need to fall back to a different alignment algorithm. Git uses the standard diff algorithm at that point (Eugene Myers' O(ND) algorithm).
e.g., if the two files are:
a 12121 e 1212 b ee c x d
a 21212 e 2121 b ye c d
First, the patience algorithm aligns any lines that are unique and exist in both files:
a b c d
a b c d
Each subrange between those lines is then aligned recursively, first doing the patience algorithm again, then doing LCS algorithms if the patience algorithm doesn't match anything.
1212 e 121 | ee | x
2121 e 2 | ye |
In the first subrange, e is now unique on both so the second patience diff pass will align it, splitting that into two new subranges. The new first subrange (12121 vs 21212) doesn't have any unique lines, so it will be aligned with the LCS algorithm. The second new subrange (1212 vs 2121) is done with a second pass of the LCS algorithm.
The second grouping above (ee vs ye) doesn't have any unique lines, so they'll be aligned using the LCS algorithm as well.
The final grouping (x vs nothing) just outputs x as a delete, without doing either the patience or LCS algorithms.
This is an interview question. I have K machines each of which is connected to 1 central machine. Each of the K machines have an array of 4 byte numbers in file. You can use any data structure to load those numbers into memory on those machines and they fit. Numbers are not unique across K machines. Find the K largest numbers in the union of the numbers across all K machines. What is the fastest I can do this?
(This is an interesting problem because it involves parallelism. As I haven't encountered parallel algorithm optimization before, it's quite amusing: you can get away with some ridiculously high-complexity steps, because you can make up for it later. Anyway, onto the answer...)
> "What is the fastest I can do this?"
The best you can do is O(K). Below I illustrate both a simple O(K log(K)) algorithm, and the more complex O(K) algorithm.
First step:
Each computer needs enough time to read every element. This means that unless the elements are already in memory, one of the two bounds on the time is O(largest array size). If for example your largest array size varies as O(K log(K)) or O(K^2) or something, no amount of algorithmic trickery will let you go faster than that. Thus the actual best running time is O(max(K, largestArraySize)) technically.
Let us say the arrays have a max length of N, which is <=K. With the above caveat, we're allowed to bound N<K since each computer has to look at each of its elements at least once (O(N) preprocessing per computer), each computer can pick the largest K elements (this is known as finding kth-order-statistics, see these linear-time algorithms). Furthermore, we can do so for free (since it's also O(N)).
Bounds and reasonable expectations:
Let's begin by thinking of some worst-case scenarios, and estimates for the minimum amount of work necessary.
One minimum-work-necessary estimate is O(K*N/K) = O(N), because we need to look at every element at the very least. But, if we're smart, we can distribute the work evenly across all K computers (hence the division by K).
Another minimum-work-necessary estimate is O(N): if one array is larger than all elements on all other computers, we return the set.
We must output all K elements; this is at least O(K) to print them out. We can avoid this if we are content merely knowing where the elements are, in which case the O(K) bound does not necessarily apply.
Can this bound of O(N) be achieved? Let's see...
Simple approach - O(NlogN + K) = O(KlogK):
For now let's come up with a simple approach, which achieves O(NlogN + K).
Consider the data arranged like so, where each column is a computer, and each row is a number in the array:
computer: A B C D E F G
10 (o) (o)
9 o (o) (o)
8 o (o)
7 x x (x)
6 x x (x)
5 x ..........
4 x x ..
3 x x x . .
2 x x . .
1 x x .
0 x x .
You can also imagine this as a sweep-line algorithm from computation geometry, or an efficient variant of the 'merge' step from mergesort. The elements with parentheses represent the elements with which we'll initialize our potential "candidate solution" (in some central server). The algorithm will converge on the correct o responses by dumping the (x) answers for the two unselected os.
Algorithm:
All computers start as 'active'.
Each computer sorts its elements. (parallel O(N logN))
Repeat until all computers are inactive:
Each active computer finds the next-highest element (O(1) since sorted) and gives it to the central server.
The server smartly combines the new elements with the old K elements, and removes an equal number of the lowest elements from the combined set. To perform this step efficiently, we have a global priority queue of fixed size K. We insert the new potentially-better elements, and bad elements fall out of the set. Whenever an element falls out of the set, we tell the computer which sent that element to never send another one. (Justification: This always raises the smallest element of the candidate set.)
(sidenote: Adding a callback hook to falling out of a priority queue is an O(1) operation.)
We can see graphically that this will perform at most 2K*(findNextHighest_time + queueInsert_time) operations, and as we do so, elements will naturally fall out of the priority queue. findNextHighest_time is O(1) since we sorted the arrays, so to minimize 2K*queueInsert_time, we choose a priority queue with an O(1) insertion time (e.g. a Fibonacci-heap based priority queue). This gives us an O(log(queue_size)) extraction time (we cannot have O(1) insertion and extraction); however, we never need to use the extract operation! Once we are done, we merely dump the priority queue as an unordered set, which takes O(queue_size)=O(K) time.
We'd thus have O(N log(N) + K) total running time (parallel sorting, followed by O(K)*O(1) priority queue insertions). In the worst case of N=K, this is O(K log(K)).
The better approach - O(N+K) = O(K):
However I have come up with a better approach, which achieves O(K). It is based on the median-of-median selection algorithm, but parallelized. It goes like this:
We can eliminate a set of numbers if we know for sure that there are at least K (not strictly) larger numbers somewhere among all the computers.
Algorithm:
Each computer finds the sqrt(N)th highest element of its set, and splits the set into elements < and > it. This takes O(N) time in parallel.
The computers collaborate to combine those statistics into a new set, and find the K/sqrt(N)th highest element of that set (let's call it the 'superstatistic'), and note which computers have statistics < and > the superstatistic. This takes O(K) time.
Now consider all elements less than their computer's statistics, on computers whose statistic is less than the superstatistic. Those elements can be eliminated. This is because the elements greater than their computer's statistic, on computers whose statistic is larger than the superstatistic, are a set of K elements which are larger. (See the visual here).
Now, the computers with the uneliminated elements evenly redistribute their data to the computers who lost data.
Recurse: you still have K computers, but the value of N has decreased. Once N is less than a predetermined constant, use the previous algorithm I mentioned in "simple approach - O(NlogN + K)"; except in this case, it is now O(K). =)
It turns out that the reductions are O(N) total (amazingly not order K), except perhaps the final step which might by O(K). Thus this algorithm is O(N+K) = O(K) total.
Analysis and simulation of O(K) running time below. The statistics allow us to divide the world into four unordered sets, represented here as a rectangle divided into four subboxes:
------N-----
N^.5
________________
| | s | <- computer
| | #=K s REDIST. | <- computer
| | s | <- computer
| K/N^.5|-----S----------| <- computer
| | s | <- computer
K | s | <- computer
| | s ELIMIN. | <- computer
| | s | <- computer
| | s | <- computer
| |_____s__________| <- computer
LEGEND:
s=statistic, S=superstatistic
#=K -- set of K largest elements
(I'd draw the relation between the unordered sets of rows and s-column here, but it would clutter things up; see the addendum right now quickly.)
For this analysis, we will consider N as it decreases.
At a given step, we are able to eliminate the elements labelled ELIMIN; this has removed area from the rectangle representation above, reducing the problem size from K*N to , which hilariously simplifies to
Now, the computers with the uneliminated elements redistribute their data (REDIST rectangle above) to the computers with eliminated elements (ELIMIN). This is done in parallel, where the bandwidth bottleneck corresponds to the length of the short size of REDIST (because they are outnumbered by the ELIMIN computers which are waiting for their data). Therefore the data will take as long to transfer as the long length of the REDIST rectangle (another way of thinking about it: K/√N * (N-√N) is the area, divided by K/√N data-per-time, resulting in O(N-√N) time).
Thus at each step of size N, we are able to reduce the problem size to K(2√N-1), at the cost of performing N + 3K + (N-√N) work. We now recurse. The recurrence relation which will tell us our performance is:
T(N) = 2N+3K-√N + T(2√N-1)
The decimation of the subproblem size is much faster than the normal geometric series (being √N rather than something like N/2 which you'd normally get from common divide-and-conquers). Unfortunately neither the Master Theorem nor the powerful Akra-Bazzi theorem work, but we can at least convince ourselves it is linear via a simulation:
>>> def T(n,k=None):
... return 1 if n<10 else sqrt(n)*(2*sqrt(n)-1)+3*k+T(2*sqrt(n)-1, k=k)
>>> f = (lambda x: x)
>>> (lambda n: T((10**5)*n,k=(10**5)*n)/f((10**5)*n) - T(n,k=n)/f(n))(10**30)
-3.552713678800501e-15
The function T(N) is, at large scales, a multiple of the linear function x, hence linear (doubling the input doubles the output). This method, therefore, almost certainly achieves the bound of O(N) we conjecture. Though see the addendum for an interesting possibility.
...
Addendum
One pitfall is accidentally sorting. If we do anything which accidentally sorts our elements, we will incur a log(N) penalty at the least. Thus it is better to think of the arrays as sets, to avoid the pitfall of thinking that they are sorted.
Also we might initially think that with the constant amount of work at each step of 3K, so we would have to do work 3Klog(log(N)) work. But the -1 has a powerful role to play in the decimation of the problem size. It is very slightly possible that the running time is actually something above linear, but definitely much smaller than even Nlog(log(log(log(N)))). For example it might be something like O(N*InverseAckermann(N)), but I hit the recursion limit when testing.
The O(K) is probably only due to the fact that we have to print them out; if we are content merely knowing where the data is, we might even be able to pull off an O(N) (e.g. if the arrays are of length O(log(K)) we might be able to achieve O(log(K)))... but that's another story.
The relation between the unordered sets is as follows. Would have cluttered things up in explanation.
.
_
/ \
(.....) > s > (.....)
s
(.....) > s > (.....)
s
(.....) > s > (.....)
\_/
v
S
v
/ \
(.....) > s > (.....)
s
(.....) > s > (.....)
s
(.....) > s > (.....)
\_/
Find the k largest numbers on each machine. O(n*log(k))
Combine the results (on a centralized server, if k is not huge, otherwise you can merge them in a tree-hierarchy accross the server cluster).
Update: to make it clear, the combine step is not a sort. You just pick the top k numbers from the results. There are many ways to do this efficiently. You can use a heap for example, pushing the head of each list. Then you can remove the head from the heap and push the head from the list the element belonged to. Doing this k times gives you the result. All this is O(k*log(k)).
Maintain a min heap of size 'k' in the centralized server.
Initially insert first k elements into the min heap.
For the remaining elements
Check(peek) for the min element in the heap (O(1))
If the min element is lesser than the current element, then remove the min element from heap and insert the current element.
Finally min heap will have 'k' largest elements
This would require n(log k) time.
I would suggest something like this:
take the k largest numbers on each machine in sorted order O(Nk) where N is the number of element on each machine
sort each of these arrays of k elements by largest element (you will get k arrays of k elements sorted by largest element : a square matrix kxk)
take the "upper triangle" of the matrix made of these k arrays of k elements, (the k largest element will be in this upper triangle)
the central machine can now find the k largest element of these k(k+1)/2 elements
Let the machines find the out k largest elements copy it into a
datastructure (stack), sort it and pass it on to the Central
machine.
At the central machine receive the stacks from all the machine. Find
the greatest of the elements at the top of the stacks.
Pop out the greatest element form its stack and copy it to the 'TopK list'.
Leave the other stacks intact.
Repeat step 3, k times to get Top K numbers.
1) sort the items on every machine
2) use a k - binary heap on the central machine
a) populate the heap with first (max) element from each machine
b) extract the first element, and put back in the heap the first element from the machine that you extracted the element. (of course heapify your heap, after the element is added).
Sort will be O(Nlog(N)) where N is the max array on the machines.
O(k) - to build the heap
O(klog(k)) to extract and populate the heap k times.
Complexity is max(O(klog(k)),O(Nlog(N)))
I would think the MapReduce paradigm would be well suited to a task like this.
Every machine runs it's own independent map task to find the maximum value in its array (depends on the language used) and this will probably be O(N) complexity for N numbers on each machine.
The reduce task compares the result from the individual machines' outputs to give you the largest k numbers.
I am looking for the fastest way to solve the following problem:
Given some volume of lattice points in a 3D grid, some points b_i (the boundary) satisfy f(b_i)=0, while another point a_0 satisfies f(a_0)= 1.
All other points (non-boundary) are some linear combination of the surrounding 26 points. For example, I could want
f(i, j, k)= .05*f(i+1, j, k)+.1*f(i, j+1, k)+.07*f(i, j, k+1)+...
The sum of the coefficients .05+.1+.07+... will add up to 1. My objective is to solve for f(x_i) for all x_i in the volume.
Currently, I am using the successive over-relaxation (SOR) method, which basically initializes the boundary of the volume, assigns to each point the weighted average of the 26 surrounding points, and repeats. The SOR method just takes a combination of f(x_i) after the most recent iteration and f(x_i) after an iteration before.
I was wondering if anyone knows of any faster ways to solve this problem for a 3D grid around the size 102x102x48. SOR currently takes about 500-1000 iterations to converge to the level I want (varying depending on the coefficients used). I am most willing to use matlab, idl, and c++. Does anyone have any idea of how fast SOR is compared to converting the problem into a linear system and using matrix methods (like BCGSTAB)? Also, which method would be most effectively (and easily) parallelized? I have access to a 250 core cluster, and have been trying to make the SOR method parallel using mpi and c++, but haven't seen as much speed increase as I would like (ideal would be on the order of 100-fold). I would greatly appreciate any ideas on speeding up the solution to this problem. Thanks.
If you're comfortable with multithreading, using a red-black scheme for SOR can give a decent speedup. For a 2D problem, imagine a checkerboard - the red squares only depend on the black squares (and possibly themselves), so you can update all the red squares in parallel, and then repeat for all the black ones. Note that this does converge more slowly than the simple ordering, but it lets you spread the work over multiple threads.
Conjugate gradient methods will generally converge faster than SOR (if I remember correctly, by about an order of magnitude). I've never used BCGSTAB, but I remember GMRES working well on non-symmetric problems, and they can probably both benefit from preconditioning.
As for opportunities for parallelization, most CG-type methods only need you to compute the matrix-vector product A*x, so you never need to form the full matrix. That's probably the biggest cost of each iteration, and so it's where you should look at multithreading.
Two good references on this are Golub and Van Loan, and Trefethen and Bau. The latter is much more readable IMHO.
Hope that helps...
I've got (I think) an exact solution, however, it might be long. The main problem would be computation on a very very big (and hopefully sparse) matrix.
Let's say you have N points in your volume. Let's call p1...pN these points. What you are looking for is f(p1)...f(pN)
Let's take a point pi. It has got 26 neighbours. Let's call them pi-1...pi-26. I suppose you have access for each point pi to the coefficients of the linear combination. Let's call them ci-1...ci-j...ci-26 (for "coefficient of the linear combination for point pi to its j-th neighbour)
Let's do even better, you can consider that pi is a linear combination of all the points in your space, except that most of them (except 26) are equals to 0. You have your coefficients ci-1...ci-N.
You can now build a big matrix N*N of these ci-j coefficients :
+--------------------- -------+
| 0 | c1-2 | c1-3 | ... | c1-N | |f(p1)| |f(p1)|
+--------------------- -------+ | | | |
| c2_1| 0 | c2-3 | ... | c1-N | |f(p2)| |f(p2)|
+--------------------- -------+ * . . = . .
| | . . . .
. . . . . .
+--------------------- -------+ . . . .
| cN-1| cN-2 | cN-3 | ... | 0 | |f(pN)| |f(pN)|
+--------------------- -------+
Amazing! The solution you're looking for is one of the eigenvectors corresponding to the eigenvalue 1!
Use an optimised matrix library that computes eigenvectors efficiently (for sparse matrix) and hope it is already parallelised!
Edit : Amusing, I just reread your post, seems I just gave you your BCGSTAB solution... Sorry! ^^
Re-edit : In fact, I'm not sure, are you talking about "Biconjugate Gradient Stabilized Method"? Because I don't see the linear method you're talking about if you're doing gradient descent...
Re-re-edit : I love math =) I can even prove that given the condition sum of the ci = 1 that 1 is in fact an eigenvalue. I do not know the dimension of the corresponding space, but it is at least 1!