How to understand the "Relaxing the monotonicity assumption" - virtual-memory

CSAPP 9.93 Allocator Requirements ans Goals
we use "peak utilization" to characterize how efficiently an allocator uses the heap. We are given some sequence of n allocate and free requests "R0, R1, R2....". If an application requests a block of p bytes , then the resulting allocated block has a payload of p bytes. After request R_k has completed, let the aggregated payload, denoted P_k, be the sum of the payloads of the currently allocated blocks, and let H_k denote the current(monotonically nondecreasing) size of the heap.
Then the peak utilization over the first k+1 requests , denoted by U_k, is given by U_k = max(Pi)/H_k ( i <= k)
"we could relax the monotonically nondecreasing assumption in our definition of U_k ans allow the heap to grow up and down by letting H_k be the high-water mark over the first k+1 requests."
How to understand "allow the heap to grow up and down by letting H_k be the high-water mark over the first k+1 requests"?

Related

Counting sort in O(1) space

I have the following counting sort, but the space complexity is too high for me, I'm looking for a way to do it in space complexity of O(1)
MyCountingSort(A, B, k)
for i = 0 to k
do G[i] = 0
for j = 0 to length[A]
do G[A[j]] = G[A[j]] + 1
for i = 1 to k
do G[i] = G[i] + G[i-1]
for j = length(A) to 1
do B[G[A[j]]] = A[j]
G[A[j]] = G[A[j]] - 1
Currently, the algorithm is allocating O(k) space.
Assuming k<=A.length, how I can improve the algorithm space complexity to O(1)?
I'm assuming here that A is your input array, B is your output array. Thus, |A| = |B|. I'm further assuming that k is the maximum number of values we might encounter (for instance, if A contains only positive numbers from 1 to k or from 0 to k-1). It would help us if you specify this kind of details when asking a question, but I'm guessing that this is more or less what you are asking. :)
Since we have the very convenient additional constraint that k <= |A|, we can use our given arrays A and B as intermediate storage for our index array. Essentially, make B your G in your code and perform the 1st and 2nd loop on it. Then we make the cumulative additions (3rd loop).
Once we have finished this, we can copy B back to A. Finally, we overwrite B with our final sorted array (4th loop in your code).
This way, no memory is allocated apart from the input parameters already given. In general, the space complexity of an algorithm is defined as independent of the input of the algorithm. Since we are only recycling the input arrays and not allocating anything ourselves, this algorithm is indeed of O(1) space complexity.
Notice that in the general case (where k is not necessarily <= |A|), it will not be this easy. In addition, it is only because the output array B has already been provided to us as an input that we can make use of this "trick" of using it for our internal use and thus not have to allocate any new memory.

Parity of permutation with parallelism

I have an integer array of length N containing the values 0, 1, 2, .... (N-1), representing a permutation of integer indexes.
What's the most efficient way to determine if the permutation has odd or even parity, given I have parallel compute of O(N) as well?
For example, you can sum N numbers in log(N) with parallel computation. I expect to find the parity of permutations in log(N) as well, but cannot seem to find an algorithm. I also do not know how this "complexity order with parallel computation" is called.
The number in each array slot is the proper slot for that item. Think of it as a direct link from the "from" slot to the "to" slot. An array like this is very easy to sort in O(N) time with a single CPU just by following the links, so it would be a shame to have to use a generic sorting algorithm to solve this problem. Thankfully...
You can do this easily in O(log N) time with Ω(N) CPUs.
Let A be your array. Since each array slot has a single link out (the number in that slot) and a single link in (that slot's number is in some slot), the links break down into some number of cycles.
The parity of the permutation is the oddness of N-m, where N is the length of the array and m is the number of cycles, so we can get your answer by counting the cycles.
First, make an array S of length N, and set S[i] = i.
Then:
Repeat ceil(log_2(N)) times:
foreach i in [0,N), in parallel:
if S[i] < S[A[i]] then:
S[A[i]] = S[i]
A[i] = A[A[i]]
When this is finished, every S[i] will contain the smallest index in the cycle containing i. The first pass of the inner loop propagates the smallest S[i] to the next slot in the cycle by following the link in A[i]. Then each link is made twice as long, so the next pass will propagate it to 2 new slots, etc. It takes at most ceil(log_2(N)) passes to propagate the smallest S[i] around the cycle.
Let's call the smallest slot in each cycle the cycle's "leader". The number of leaders is the number of cycles. We can find the leaders just like this:
foreach i in [0,N), in parallel:
if (S[i] == i) then:
S[i] = 1 //leader
else
S[i] = 0 //not leader
Finally, we can just add up the elements of S to get the number of cycles in the permutation, from which we can easily calculate its parity.
You didn't specify a machine model, so I'll assume that we're working with an EREW PRAM. The complexity measure you care about is called "span", the number of rounds the computation takes. There is also "work" (number of operations, summed over all processors) and "cost" (span times number of processors).
From the point of view of theory, the obvious answer is to modify an O(log n)-depth sorting network (AKS or Goodrich's Zigzag Sort) to count swaps, then return (number of swaps) mod 2. The code is very complex, and the constant factors are quite large.
A more practical algorithm is to use Batcher's bitonic sorting network instead, which raises the span to O(log2 n) but has reasonable constant factors (such that people actually use it in practice to sort on GPUs).
I can't think of a practical deterministic algorithm with span O(log n), but here's a randomized algorithm with span O(log n) with high probability. Assume n processors and let the (modifiable) input be Perm. Let Coin be an array of n Booleans.
In each of O(log n) passes, the processors do the following in parallel, where i ∈ {0…n-1} identifies the processor, and swaps ← 0 initially. Lower case variables denote processor-local variables.
Coin[i] ← true with probability 1/2, false with probability 1/2
(barrier synchronization required in asynchronous models)
if Coin[i]
j ← Perm[i]
if not Coin[j]
Perm[i] ← Perm[j]
Perm[j] ← j
swaps ← swaps + 1
end if
end if
(barrier synchronization required in asynchronous models)
Afterwards, we sum up the local values of swaps and mod by 2.
Each pass reduces the number of i such that Perm[i] ≠ i by 1/4 of the current total in expectation. Thanks to the linearity of expectation, the expected total is at most n(3/4)r, so after r = 2 log4/3 n = O(log n) passes, the expected total is at most 1/n, which in turn bounds the probability that the algorithm has not converged to the identity permutation as required. On failure, we can just switch to the O(n)-span serial algorithm without blowing up the expected span, or just try again.

Select some sets, and union them together to form main set, in a way that minimizes the cost

Definition
Set P={e1,e2,...,en},P has n different elements,enumerated as ei's in it.
Set I={e1',e2',...,en'},I has at least one element that is similar to some element of P.The number of elements in I need not be equal to the number of elements in P.
Each I has a weight Q associated with it, and that describes the cost to use it .Q>0
You have to help me in designing an algorithm, that takes a set P as input, and some (say k of them) I sets, denoted by I1,I2,. . . , Ik, and exactly k, Q values, denoted by Q1,Q2,. . . ,Qk. Q1 denots the cost to use set I1, and so on.
You have to choose some I's, say I1,I2,. . . , such that when they all are unioned together, they produce set P' and P is a subset of that.
Notice that once you find a selection of I's, it has a cost associated with it.
You also have to make sure that this cost is as MINIMUM as possible.
Input
input one Set P
input a list of Set I,IList={I1,I2,...In}
input a list of Set Q,QList={Q1,Q2,...Qn}
Ix Qx are corresponding one by one.
Output
P' = Ia union Ib...union In
P' ⊂ P
Make the Qa+Qb...+Qn be the min value.
Also mention the Time and Space Complexity of your algorithm
Sample Input
P={a,b,c}
I1={a,x,y,z} Q1=0.7
I2={b,c,x} Q2=1
I3={b,x,y,z} Q3=2
I4={c,y} Q4=3
I5={a,b,c,y} Q5=9
Sample Output
P1 = I1 U I2 COST=Q1+Q2=1.7
P2 = I1 U I3 U I4 COST=Q1+Q3+Q4=5.7
P3 = I5 COST=Q5=9
And:P⊂P1,P⊂P2,P⊂P3
The P COST : 1.7<5.7<9
And then what we want is:
P1 = I1 U I2 COST=Q1+Q2=1.7
Here is some suggestion to simplify the problem.
We first duplicate all the I sets, and lets call them I1', I2', . . .
Now, first job that we should do is to remove the unwanted elements from duplicated I' sets. Here unwanted means the elements which will not contribute towards the main set P.
We discard all those I' sets which do not have even a single element of P.
Now suppose P has some n elements in it, we now know definitely that I' sets are nothing but subsets of the main set, and every subset has a cost Qi associated with it.
We just have to pick some subsets such that they together cover the main set.
Subject to the minimum cost.
We will denote the main set and subsets using bit based notation.
If the set P has n elements in it, we will have n bits in the representation.
So the main set will be denoted by <1,1,...1> (n 1's).
And it's subsets will be denoted by bitset, having some 1's absent from the bitset of main set. Because I's are also subsets, they will also have some binary representation denoting the subset they are representing.
To solve the problem efficiently, let's make an assumption that there is so much of memory available, that if the bitset is treated as a number in binary, we can index the bitsets, to some memory location in constant time.
This means that, if we have, suppose n = 4, all the subsets can be represented
by different values from 0 to 15 (see their binary representation from 0000(empty set) to 1111(main set), when element at position i of main array is present in a subset we put a 1 at that position in the bitset). And similarly when n is larger.
Now, having the bitset based notation for the set, the Union of two sets denoted by bitset b1 and b2 will be denoted by b1|b2. where | is bitwise OR operation.
Of course, we will not require so many memory locations, as not all the subsets of parent set will be available as I's.
Algorithm :
The algorithmic idea used here is bitset based Dynamic Programming.
Assume we have a big array, namely COST, where COST[j] represents the cost to have the subset, represented by bitset notation of j.
To start with the algorithm, we first put the cost to choose given subsets (in terms of I's), in their respective indices in COST array, and at all the other locations we put a very large value, say INF.
What we have to do is, to fill the array appropriately, and then once it is filled properly, we will get the answer to minimum cost by looking at the value COST[k] where k has all bits set, in binary representation.
Now we will focus on how to fill the array properly.
This is rather easy task, we will iterate the COST array, K no. of times where K is the no. of I'-sets we have.
For every I's set, let's call it's binary representation BI'.
we OR the bit representation of BI' and current index(idx), and what we get is the new set which is the UNION of the set represented by current index, and BI', let's call this new set as S' and it's final binary representation as BS'.
We will look at the COST[BS'], and if we see that this COST is larger than COST[BI'] + COST[idx], we will update the value at the COST[BS'].
In similar way we proceed, and at the end of the run, we get the minimum cost at COST[BP], where BP is the bitset for P.
In order to track the participating I's, who actually contributed in the formation of P, we can take a note, while updating any index.
TIME COMPLEXITY : O(2^n * K), where K is the no. of I sets, and n is the no. of elements in P.
Space Complexity : O(2^n)
NOTE : Because of the assumption, that the bit-representation are directly indexable, the solution may not be very much feasible for large values of n and k.

In a looping linked list, what guarantee is there that the fast and slow runners will collide? [duplicate]

I had a look at question already which talk about algorithm to find loop in a linked list. I have read Floyd's cycle-finding algorithm solution, mentioned at lot of places that we have to take two pointers. One pointer( slower/tortoise ) is increased by one and other pointer( faster/hare ) is increased by 2. When they are equal we find the loop and if faster pointer reaches null there is no loop in the linked list.
Now my question is why we increase faster pointer by 2. Why not something else? Increasing by 2 is necessary or we can increase it by X to get the result. Is it necessary that we will find a loop if we increment faster pointer by 2 or there can be the case where we need to increment by 3 or 5 or x.
From a correctness perspective, there is no reason that you need to use the number two. Any choice of step size will work (except for one, of course). However, choosing a step of size two maximizes efficiency.
To see this, let's take a look at why Floyd's algorithm works in the first place. The idea is to think about the sequence x0, x1, x2, ..., xn, ... of the elements of the linked list that you'll visit if you start at the beginning of the list and then keep on walking down it until you reach the end. If the list does not contain a cycle, then all these values are distinct. If it does contain a cycle, though, then this sequence will repeat endlessly.
Here's the theorem that makes Floyd's algorithm work:
The linked list contains a cycle if and only if there is a positive integer j such that for any positive integer k, xj = xjk.
Let's go prove this; it's not that hard. For the "if" case, if such a j exists, pick k = 2. Then we have that for some positive j, xj = x2j and j ≠ 2j, and so the list contains a cycle.
For the other direction, assume that the list contains a cycle of length l starting at position s. Let j be the smallest multiple of l greater than s. Then for any k, if we consider xj and xjk, since j is a multiple of the loop length, we can think of xjk as the element formed by starting at position j in the list, then taking j steps k-1 times. But each of these times you take j steps, you end up right back where you started in the list because j is a multiple of the loop length. Consequently, xj = xjk.
This proof guarantees you that if you take any constant number of steps on each iteration, you will indeed hit the slow pointer. More precisely, if you're taking k steps on each iteration, then you will eventually find the points xj and xkj and will detect the cycle. Intuitively, people tend to pick k = 2 to minimize the runtime, since you take the fewest number of steps on each iteration.
We can analyze the runtime more formally as follows. If the list does not contain a cycle, then the fast pointer will hit the end of the list after n steps for O(n) time, where n is the number of elements in the list. Otherwise, the two pointers will meet after the slow pointer has taken j steps. Remember that j is the smallest multiple of l greater than s. If s ≤ l, then j = l; otherwise if s > l, then j will be at most 2s, and so the value of j is O(s + l). Since l and s can be no greater than the number of elements in the list, this means than j = O(n). However, after the slow pointer has taken j steps, the fast pointer will have taken k steps for each of the j steps taken by the slower pointer so it will have taken O(kj) steps. Since j = O(n), the net runtime is at most O(nk). Notice that this says that the more steps we take with the fast pointer, the longer the algorithm takes to finish (though only proportionally so). Picking k = 2 thus minimizes the overall runtime of the algorithm.
Hope this helps!
Let us suppose the length of the list which does not contain the loop be s, length of the loop be t and the ratio of fast_pointer_speed to slow_pointer_speed be k.
Let the two pointers meet at a distance j from the start of the loop.
So, the distance slow pointer travels = s + j. Distance the fast pointer travels = s + j + m * t (where m is the number of times the fast pointer has completed the loop). But, the fast pointer would also have traveled a distance k * (s + j) (k times the distance of the slow pointer).
Therefore, we get k * (s + j) = s + j + m * t.
s + j = (m / k-1)t.
Hence, from the above equation, length the slow pointer travels is an integer multiple of the loop length.
For greatest efficiency , (m / k-1) = 1 (the slow pointer shouldn't have traveled the loop more than once.)
therefore , m = k - 1 => k = m + 1
Since m is the no.of times the fast pointer has completed the loop , m >= 1 .
For greatest efficiency , m = 1.
therefore k = 2.
if we take a value of k > 2 , more the distance the two pointers would have to travel.
Hope the above explanation helps.
Consider a cycle of size L, meaning at the kth element is where the loop is: xk -> xk+1 -> ... -> xk+L-1 -> xk. Suppose one pointer is run at rate r1=1 and the other at r2. When the first pointer reaches xk the second pointer will already be in the loop at some element xk+s where 0 <= s < L. After m further pointer increments the first pointer is at xk+(m mod L) and the second pointer is at xk+((m*r2+s) mod L). Therefore the condition that the two pointers collide can be phrased as the existence of an m satisfying the congruence
m = m*r2 + s (mod L)
This can be simplified with the following steps
m(1-r2) = s (mod L)
m(L+1-r2) = s (mod L)
This is of the form of a linear congruence. It has a solution m if s is divisible by gcd(L+1-r2,L). This will certainly be the case if gcd(L+1-r2,L)=1. If r2=2 then gcd(L+1-r2,L)=gcd(L-1,L)=1 and a solution m always exists.
Thus r2=2 has the good property that for any cycle size L, it satisfies gcd(L+1-r2,L)=1 and thus guarantees that the pointers will eventually collide even if the two pointers start at different locations. Other values of r2 do not have this property.
If the fast pointer moves 3 steps and slow pointer at 1 step, it is not guaranteed for both pointers to meet in cycles containing even number of nodes. If the slow pointer moved at 2 steps, however, the meeting would be guaranteed.
In general, if the hare moves at H steps, and tortoise moves at T steps, you are guaranteed to meet in a cycle iff H = T + 1.
Consider the hare moving relative to the tortoise.
Hare's speed relative to the tortoise is H - T nodes per iteration.
Given a cycle of length N =(H - T) * k, where k is any positive
integer, the hare would skip every H - T - 1 nodes (again, relative
to the tortoise), and it would be impossible to for them to meet if
the tortoise was in any of those nodes.
The only possibility where a meeting is guaranteed is when H - T - 1 = 0.
Hence, increasing the fast pointer by x is allowed, as long as the slow pointer is increased by x - 1.
Here is a intuitive non-mathematical way to understand this:
If the fast pointer runs off the end of the list obviously there is no cycle.
Ignore the initial part where the pointers are in the initial non-cycle part of the list, we just need to get them into the cycle. It doesn't matter where in the cycle the fast pointer is when the slow pointer finally reaches the cycle.
Once they are both in the cycle, they are circling the cycle but at different points. Imagine if they were both moving by one each time. Then they would be circling the cycle but staying the same distance apart. In other words, making the same loop but out of phase. Now by moving the fast pointer by two each step they are changing their phase with each other; Decreasing their distance apart by one each step. The fast pointer will catch up to the slow pointer and we can detect the loop.
To prove this is true, that they will meet each other and the fast pointer will not somehow overtake and skip over the slow pointer just hand simulate what happens when the fast pointer is three steps behind the slow, then simulate what happens when the fast pointer is two steps behind the slow, then when the fast pointer is just one step behind the slow pointer. In every case they meet at the same node. Any larger distance will eventually become a distance of three, two or one.
If there is a loop (of n nodes), then once a pointer has entered the loop it will remain there forever; so we can move forward in time until both pointers are in the loop. From here on the pointers can be represented by integers modulo n with initial values a and b. The condition for them to meet after t steps is then
a+t≡b+2t mod n
which has solution t=a−b mod n.
This will work so long as the difference between the speeds shares no prime factors with n.
Reference
https://math.stackexchange.com/questions/412876/proof-of-the-2-pointer-method-for-finding-a-linked-list-loop
The single restriction on speeds is that their difference should be co-prime with the loop's length.
Theoretically, consider the cycle(loop) as a park(circular, rectangle whatever), First person X is moving slow and Second person Y is moving faster than X. Now, it doesn't matter if person Y is moving with speed of 2 times that of X or 3,4,5... times. There will always be a case when they meet at one point.
Say we use two references Rp and Rq which take p and q steps in each iteration; p > q. In the Floyd's algorithm, p = 2, q = 1.
We know that after certain iterations, both Rp and Rq will be at some elements of the loop. Then, say Rp is ahead of Rq by x steps. That is, starting at the element of Rq, we can take x steps to reach the element of Rp.
Say, the loop has n elements. After t further iterations, Rp will be ahead of Rq by (x + (p-q)*t) steps. So, they can meet after t iterations only if:
n divides (x + (p-q)*t)
Which can be written as:
(p−q)*t ≡ (−x) (mod n)
Due to modular arithmetic, this is possible only if: GCD(p−q, n) | x.
But we do not know x. Though, if the GCD is 1, it will divide any x. To make the GCD as 1:
if n is not known, choose any p and q such that (p-q) = 1. Floyd's algorithm does have p-q = 2-1 = 1.
if n is known, choose any p and q such that (p-q) is coprime with n.
Update: On some further analysis later, I realized that any unequal positive integers p and q will make the two references meet after some iterations. Though, the values of 1 and 2 seem to require less number of total stepping.
The reason why 2 is chosen is because lets say
slow pointer moves at 1
fast moves at 2
The loop has 5 elements.
Now for slow and fast pointer to meet ,
Lowest common multiple (LCM) of 1,2 and 5 must exist and thats where they meet. In this case its 10.
If you simulate slow and fast pointer you will see that the slow and fast pointer meet at 2 * elements in loop. When you do 2 loops , you meet at exactly same point of as starting point.
In case of non loop , it becomes LCM of 1,2 and infinity. so they never meet.
If the linked list has a loop then a fast pointer with increment of 2 will work better then say increment of 3 or 4 or more because it ensures that once we are inside the loop the pointers will surely collide and there will be no overtaking.
For example if we take increment of 3 and inside the loop lets assume
fast pointer --> i
slow --> i+1
the next iteration
fast pointer --> i+3
slow --> i+2
whereas such case will never happen with increment of 2.
Also if you are really unlucky then you may end up in a situation where loop length is L and you are incrementing the fast pointer by L+1. Then you will be stuck infinitely since the difference of the movement fast and slow pointer will always be L.
I hope I made myself clear.

Implementing Deque using 3 Stacks (Amortized time O(1))

I have this question for howmework:
Implement a Deque using 3 Stacks. The Deque have those operations : InsertHead, InsertTail, DeleteHead,DeleteTail. Prove that the amortized time for each operation is O(1).
What I've tried is to look at the problem as Hanoi problem.
So lets call the Stacks as: L(left), M (middle), R(right).
Pseudo-code Implementations:
InsertHead(e):
L.push(e);
DeleteHead(e):
L is empty:
while R is not empty:
pop and insert the element to M;
pop M;
while M is not empty:
pop and insert the element to R;
L is not empty:
L.pop(e);
InsertTail and DeleteTail are on the same principle of the above implementations.
How can I prove that the amortized time is O(1)?
because there can be N elements in L and the wile loop will take O(n), now if I'll call the deleteHead N times to calculate an amortized time the complexity will not be O(n^2)?
can someone help me how can I prove that the above implementations take O(1) in amortized time?
We proceed using the potential method; define
Phi = C |L.size - R.size|
For some constant C for which we will pick a value later. Let Phi_t denote the potential after t operations. Note that in a "balanced" state where both stacks have equal size, the data structure has potential 0.
The potential at any time is a constant times the difference in the number of elements in each stack. Note that Phi_0 = 0 so the potential is zero when the structure is initialised.
It is clear that a push operation increases the potential by at most C. A pop operation which does not miss (i.e. where the relevant stack is nonempty) also alters the potential by at most C. Both of these operations have true cost 1, hence they have amortised cost 1 + C.
When a pop operation occurs and causes a miss (when the stack we want to pop from is empty), the true cost of the operation is 1 + 3/2 * R.size for when we are trying to pop from L, and vice versa for when we are popping from R. This is because we move half of R's elements to M and back, and the other half of R's elements to L. The +1 is needed because of the final pop from L after this rebalancing operation is done.
Hence if we take C := 3/2, then the pop operation when a miss occurs has amortised cost 1, because the potential reduces from 3/2 * R.size to 0 due to the rebalancing, and then we may incur an additional cost of 3/2 from the pop which occurs after the rebalance.
In other words, each operation has an amortised cost bounded by a constant.
Finally, because the initial potential is 0, and the potential is always nonnegative, each operation is amortised cost O(1), as required.

Resources