1000 items, 1000 nodes, 3 items per node, best replication scheme to minimize data loss as nodes fail? [closed] - algorithm

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I was wondering what would be the right answer for Question 2-44 in Skiena's Algorithm Design Manual (2nd ed.)
The question is the following:
We have 1,000 data items to store on 1,000 nodes. Each node can store
copies of exactly three different items. Propose a replication scheme
to minimize data loss as nodes fail. What is the expected number of
data entries that get lost when three random nodes fail?
I was thinking about node n having data item from n, n+1 & n+2.
So if 3 consecutive nodes are lost then we lose 1 item.
Is there a better solution?

The approach you propose is not bad but also take a look here. The ideas used in RAID may give you some ideas. For instance if you have 2 data items, than having storage for 3 items you can recover any of them if the other fails. The idea is quite simple - you store the items in 2 nodes and the xor of their bits in the third item. I believe if you utilize this idea you will be able to have more then 3 backups of a single data item(i.e. more then 3 nodes have to fail in order to loose the information).

I thought of methods like RAID levels but Skiena says "each node can store copies of exactly three different items." Even though XOR'red bit patterns of two separate data can be stored in the same amount of space, I did not think that it was something the problem was looking for.
So, I started with what the OP thought of: Store the three copies of each data to its next two neighbors in a striped fashion. For example, the following is for when N==6 and the data are the integers from 0 to 5 (4 and 5 wrap around and use the nodes 0 and 1):
nodes: 0 1 2 3 4 5
===========
copy 0 -> 0 1 2 3 4 5
copy 1 -> 5 0 1 2 3 4
copy 2 -> 4 5 0 1 2 3
Of all the 20 combinations of three-node failures, there are six that lose exactly one piece of data. For example; when nodes 1, 2, and 3 fail, the data 1 gets lost:
===========
0 X X X 4 5
5 X X X 3 4
4 X X X 2 3
Similar for each other data, making 6 of the 20 combinations lose data. Since Skiena does not describe what "data loss" means for the application: Does the loss of a single data point mean that the entire collection is wasted, or losing a single data point is acceptable and is better than losing two?
If the loss of even a single data point means that the entire collection is wasted, then we can do better. Three times better! :)
Instead of distributing the copies of data to the right-hand nodes in a striped fashion, define groups of three nodes that share data. For example, let 0, 1, and 2 share their data and 3, 4, and 5 share their data:
nodes: 0 1 2 3 4 5
===========
copy 0 -> 0 1 2 3 4 5
copy 1 -> 2 0 1 5 3 4
copy 2 -> 1 2 0 4 5 3
This time, there are only 2 of the 20 combinations produce data loss ever. Data 0, 1, and 2 are lost together when nodes 0, 1, and 2 fail:
===========
x x x 3 4 5
x x x 5 3 4
x x x 4 5 3
And data 3, 4, and 5 are lost together when nodes 3, 4, and 5 fail:
===========
0 1 2 x x x
2 0 1 x x x
1 2 0 x x x
That amounts to just 2 of the 20 combinations of three-node failures. When the same nodes share same data, it effectively merges data losses into fewer number of combinations.
Ali

Let,
D = {1,...,d_i,...,d} denote the data items and d_i a given data element
N = {1,...,n_k,...,n} denote the storage cluster and n_k a given storage node.
We say d_i is stored by n_k, loosely denoted by d_i \in n_k.
My replication model has the following assumptions:
1- Every data item must be stored at least in one given node during initialization. I.e.:
Exist at least one 1 <= k <=n s.t. P(d_i \in n_k) = 1.
2- From (1), at initialization time, the probability of d_i to be in a given node is at least 1/n. I.e.:
For any data item 1 <= i <= d and a random node n, P(d_i \in n) >= 1/n.
Given the problem statement, by design, we want to have this distribution uniform across the data set.
3- Lastly, by design, the probability of a data item d_i to be in a given node n should be independent between data items. I.e.:
P(d_i \in n | d_j \in n) = P(d_i \in n)
This is because we don't assume the probability of node failure is independent between adjacent nodes (e.g.: in datacenters adjacent nodes be sharing the same network switch, etc).
From these assumptions, I proposed the following replication model (for the problem instance where d = n and each node stores exactly 3 distinct data items).
(1) Perform a random permutation of data set.
(2) Using a sliding window of length 3 and stride 1, rotate over the shuffled data set and map the data items to each node.
E.g.:
D = {A,B,C,D}
N = {1,2,3,4}
(1) {C, B, A, D}
(2) 1 -> {C, B, A}, 2 -> {B, A, D}, 3-> {A, D, C}, 4-> {D, C, B}
the random shuffling will ensure independent (3) and uniform distribution (2). While the sliding window of stride 1 guarantees (1).
Let's denote, the sliding window of a given node n_k as the ordered set w_k = {w_k1, w_k2, w_k3}. n_k is said to be the master node for w_k1 (first element of w_k). Any other node n_j containing w_k1 is a replica node. N.B.: the proposed replication model guarantees only one master node for any d_i, while the number of replica nodes depends on the window length.
In the example above: n_1 is the master node for C and n_3 and n_4 replica nodes.
Back to the original problem, given this schema, we can state the probability of data loss is the lost of the master node and all replicas for a given data item.
P(d_i is lost) = P(master node for d_i fails and replica 1 fails and replica 2 fails).
without formal proof, an unbiased random permutation in step (1) above would result
P(d_i is lost) = P(master node for d_i fails) * P(replica 1 fails) * P(replica 2 fails).
again, the random permutation is a heuristic to abstract the joint distribution for nodes failure.
From assumptions (2) and (3), P(d_i is lost) = c, for any d_i, at initialization time.
That said for d = n = 1000 and replication factor of 3 (i.e.: window length equals 3).
P(d_i is lost) = 1/1000 * 1/999 * 1/998 ~ 10^-9

Your approach seems essentially correct but can benefit from a failover strategy. Notice that Prof. Skiena has asked "to minimize data loss as nodes fail" which suggests that failing nodes will be a common occurrence.
You may want to have a look at consistent hashing.
Also, there is a great post by reddit engineers about the perils of not using consistent hashing (instead using a fixed MOD hashing).

Related

Data structure to handle numerous queries on large size array

Given q queries of the following form. A list is there.
1 x y: Add number x to the list y times.
2 n: find the nth number of the sorted list
constraints
1 <= q <= 5 * 100000
1 <= x, y <= 1000000000
1 <= n < length of list
sample.
input
4
1 3 6
1 5 2
2 7
2 4
output
5
3
This is a competitive programming problem that it's too early in the morning for me to solve right now, but I can try and give some pointers.
If you were to store the entire array explicitly, it would obviously blow out your memory. But you can exploit the structure of the array to instead store the number of times each entry appears in the array. So if you got the query
1 3 5
then instead of storing [3, 3, 3], you'd store the pair (3, 5), indicating that the number 3 is in the list 5 times.
You can pretty easily build this, perhaps as a vector of pairs of ints that you update.
The remaining task is to implement the 2 query, where you find an element by its index. A side effect of the structure we've chosen is that you can't directly index into that vector of pairs of ints, since the indices in that list don't match up with the indices into the hypothetical array. We could just add up the size of each entry in the vector from the start until we hit the index we want, but that's O(n^2) in the number of queries we've processed so far... likely too slow. Instead, we probably want some updatable data structure for prefix sums—perhaps as described in this answer.

Algorithm - How to pair blocks efficiently

lets say, we have two cubes of 3X3 with different heights in cell. Each cell value represents the height of that cell. For example in below block 1, cell (1,1) has height of 1, cell(1,2) has height of 2 and so on.
block-1,
1 2 3
1 3 2
3 1 2
block-2,
4 3 2
4 2 3
2 4 3
Giver two such blocks how to check efficiently whether two blocks can be connected in such a way that there would be no cell mismatched and both blocks together produce a cuboid.
For example, above block-1 + block-2 can be connected and resultant block will be a perfect cuboid height 5. Resultant cuboid will be,
5 5 5
5 5 5
5 5 5
Extension of the problem: Given a set (size >= 50K) of such 4X4 blocks how to connect pair of blocks and produce maximum height sum of resultant cuboid? You can take only matched blocks full height to maximise total height sum. Non matched blocks will be ignored. Each cell height can be up to 20 unit.
Further extension of the problem: Blocks can be given in such a way that might be rotated to make pair with other to maximise resultant cuboids height sum.
Any clue?
You could solve the problem in two steps (1) find all pairs of blocks that connect (build a cuboid) and (2) find the best pairing that maximizes the total height.
Find connecting pairs
For this I would (a) build a surface representation for each block, (b) hash the blocks by their surface representation and (c) search for each block all connecting blocks by looking for the connecting surface models.
(a) Building the surface model
The basic idea is to represent each block by its surface. For this you just subtract the minimum entry in the matrix from every entry in the matrix
surface representation of block-1 will be
1 2 3 -1 0 1 2
1 3 2 --> 0 2 1
3 1 2 2 0 2
and surface representation of block-2 will be
4 3 2 -2 2 1 0
4 2 3 --> 2 0 1
2 4 3 0 2 1
(b) hash the blocks
Now you hash the blocks by their surface representation
(c) Finding connecting pairs
For each block you then compute the connecting surface model, by taking the maximum value in the surface representation and subtracting the entries in the matrix from it,
for block-1 this will yield
0 1 2 2 1 0
2 - 0 2 1 = 2 0 1
2 0 2 0 2 0
the blocks with this surface representation can be found using the hash table (note that the surface representation of block-2 will match).
Note: when you allow for rotation then you will have to perform 4 queries on the hash table with all possible rotations.
Finding the best pairing
To find the best pairing (maximizing the sum of connected blocks) I would use the Hungarian Algorithm. In order to do this you will have to build a matrix where the entry (i, j) contains the height of the block when the two blocks i and j connect and 0 otherwise.
Edit
I think the second step (finding best pairing) can be done more efficiently, by connecting pairs of matching blocks greedily (connecting pairs resulting in highest blocks first).
The intuition for this is: When you have two blocks a and b and they both have the same surface model. Then they will either both connect to another block c or they both won't connect to c. With this in mind, after the "find connecting pairs" step you will end up with pairs of groups of blocks (Xi, Yi) where each block of Xi connects to each block of Yi. If the two groups Xi and Yi are of equal size, then we can connect in any way we want and will always get the same sum of height of resulting cuboids. If one of the groups (wlog Yi) contains less elements then we want to avoid connecting to the smallest blocks in Xi. Thus we can greedily connect starting with the largest blocks and in doing so avoid connecting to the smallest blocks.
So the algorithm could work as follows:
(Hash each block according to its surface representation. Sort
blocks with the same surface representation descending according to
their offset (height of block minus surface representation)
Process blocks in order of descending offset, for each block: Search
for connecting block cBlock with highest offset, connect the two
blocks, remove cBlock from the hash table and the processing
pipeline.
Overall this should be doable in O(n log n)

How to perform range updates in sqrt{n} time?

I have an array and I have to perform query and updates on it.
For queries, I have to find frequency of a particular number in a range from l to r and for update, I have to add x from some range l to r.
How to perform this?
I thought of sqrt{n} optimization but I don't know how to perform range updates with this time complexity.
Edit - Since some people are asking for an example, here is one
Suppose the array is of size n = 8
and it is
1 3 3 4 5 1 2 3
And there are 3 queries to help everybody explain about what I am trying to say
Here they are
q 1 5 3 - This means that you have to find the frequency of 3 in range 1 to 5 which is 2 as 3 appears on 2nd and 3rd position.
second is update query and it goes like this - u 2 4 6 -> This means that you have to add 6 in the array from range 2 to 4. So the new array will become
1 9 9 10 5 1 2 3
And the last query is again the same as first one which will now return 0 as there is no 3 in the array from position 1 to 5 now.
I believe things must be more clear now. :)
I developed this algorithm long time (20+ years) ago for Arithmetic coder.
Both Update and Retrieve are performed in O(log(N)).
I named this algorithm "Method of Intervals". Let I show you the example.
Imagine, we have 8 intervals, with numbers 0-7:
+--0--+--1--+--2-+--3--+--4--+--5--+--6--+--7--+
Lets we create additional set of intervals, each spawns pair of original ones:
+----01-----+----23----+----45-----+----67-----+
Thereafter, we'll create the extra one layer of intervals, spawn pairs of 2nd:
+---------0123---------+---------4567----------+
And at last, we create single interval, covers all 8:
+------------------01234567--------------------+
As you see, in this structure, to retrieve right border of the interval [5], you needed just add together length of intervals [0123] + [45]. to retrieve left border of the interval [5], you needed sum of length the intervals [0123] + [4] (left border for 5 is right border for 4).
Of course, left border of the interval [0] is always = 0.
When you'll watch this proposed structure carefully, you will see, the odd elements in the each layers aren't needed. I say, you do not needed elements 1, 3, 5, 7, 23, 67, 4567, since these elements aren't used, during Retrieval or Update.
Lets we remove the odd elements and make following remuneration:
+--1--+--x--+--3-+--x--+--5--+--x--+--7--+--x--+
+-----2-----+-----x----+-----6-----+-----x-----+
+-----------4----------+-----------x-----------+
+----------------------8-----------------------+
As you see, with this remuneration, used the numbers [1-8]. Lets they will be array indexes. So, you see, there is used memory O(N).
To retrieve right border of the interval [7], you needed add length of the values with indexes 4,6,7. To update length of the interval [7], you needed add difference to all 3 of these values. As result, both Retrieval and Update are performed for Log(N) time.
Now is needed algorithm, how by the original interval number compute set of indexes in this data structure. For instance - how to convert:
1 -> 1
2 -> 2
3 -> 3,2
...
7 -> 7,6,4
This is easy, if we will see binary representation for these numbers:
1 -> 1
10 -> 10
11 -> 11,10
111 -> 111,110,100
As you see, in the each chain - next value is previous value, where rightmost "1" changed to "0". Using simple bit operation "x & (x - 1)", we can wtite a simple loop to iterate array indexes, related to the interval number:
int interval = 7;
do {
int index = interval;
do_something(index);
} while(interval &= interval - 1);

Determining the Longest Continguous Subsequence

There are N nodes (1 <= N <= 100,000) various positions along a
long one-dimensional length. The ith node is at position x_i (an
integer in the range 0...1,000,000,000) and has a node type b_i(an integer in
the range 1..8). Nodes can not be in the same position
You want to get a range on this one-dimension in which all of the types of nodes are fairly represented. Therefore, you want to ensure that, for whatever types of nodes that are present in the range, there is an equal number of each node type (for example, a range with 27 each of types 1 and 3 is ok, a range with 27 of types 1, 3, and 4 is
ok, but 9 of type 1 and 10 of type 3 is not ok). You also want
at least K (K >= 2) types (out of the 8 total) to be represented in the
rand. Find the maximum size of this range that satisfies the constraints. The size of a photo is the difference between the maximum and minimum positions of the nodes in the photo.
If there are no ranges satisfying the constraints, output -1 instead.
INPUT:
* Line 1: N and K separated by a space
* Lines 2..N+1: Each line contains a description of a node as two
integers separated by a space; x(i) and its node type.
INPUT:
9 2
1 1
5 1
6 1
9 1
100 1
2 2
7 2
3 3
8 3
INPUT DETAILS:
Node types: 1 2 3 - 1 1 2 3 1 - ... - 1
Locations: 1 2 3 4 5 6 7 8 9 10 ... 99 100
OUTPUT:
* Line 1: A single integer indicating the maximum size of a fair
range. If no such range exists, output -1.
OUTPUT:
6
OUTPUT DETAILS:
The range from x = 2 to x = 8 has 2 each of types 1, 2, and 3. The range
from x = 9 to x = 100 has 2 of type 1, but this is invalid because K = 2
and so you need at least 2 distinct types of nodes.
Could You Please help in suggesting some algorithm to solve this. I have thought about using some sort of priority queue or stack data structure, but am really unsure how to proceed.
Thanks, Todd
It's not too difficult to invent almost linear-time algorithm because recently similar problem was discussed on CodeChef: "ABC-Strings".
Sort nodes by their positions.
Prepare all possible subsets of node types (for example, we could expect types 1,2,4,5,7 to be present in resulting interval and all other types not present there). For K=2 there may be only 256-8-1=247 subsets. For each subset perform remaining steps:
Initialize 8 type counters to [0,0,0,0,0,0,0,0].
For each node perform remaining steps:
Increment counter for current node type.
Take L counters for types included to current subset, subtract first of them from other L-1 counters, which produces L-1 values. Take remaining 8-L counters and combine them together with those L-1 values into a tuple of 7 values.
Use this tuple as a key for hash map. If hash map contains no value for this key, add a new entry with this key and value equal to the position of current node. Otherwise subtract value in the hash map from the position of current node and (possibly) update the best result.

Deciphering the key

Alice invents a key (s1, s2, s3, ... , sk). Bob makes a guess (g1, g2, g3, ... , gk).He is awarded one point for each si = gi.
Each s1 is an integer with the range of 0<=si<=11.
Given a q guesses with their scores bi
(g1, g2, g3, ... , gk) b1
(g1, g2, g3, ... , gk) b2
.
.
.
(g1, g2, g3, ... , gk) bq
Can you state if there is a key possible. Given 0<=si<=11, 1<=k<=11, 1<=q<=8.
For Example
2 2 1 1 2
1 1 2 2 1
For the guess 2 2 1 1 the score is 2
For the guess 1 1 2 2 the score is 1
Because there is a key possible let's say 2 1 1 3 which gives the desired scores.Hence the answer is yes
Another Example
1 2 3 4 4
4 3 2 1 1
For the guess 1 2 3 4 the score is 4
For the guess 4 3 2 1 the score is 1
This has no key which gives the desired scores hence answer is NO
I tried the brute force approach generating n^k such keys where n is the range of si.But it gave Time Limit exceeding error.
Its an interview puzzle. I have seen variants of this question but was not able to solve them.Can you tell me what should I read for such type of questions.
I don't know the best solution to this problem, but if you did a recursive search of the possible solution space, pruning branches which could not possibly lead to a solution, it would be much faster than trying all (n^k) keys.
Take your example:
1 2 3 4 4 -> 4
4 3 2 1 1 -> 1
The 3 possible values for g1 which could be significant are: 1, 4, and "neither 1 nor 4". Choose one of them, and then recursively look at the possible values for g2. Choose one, and recursively look at the possible values for g3, etc.
As you search, keep track of a cumulative score for each of the guesses from b1 to bq. Whenever you choose a value for a digit, increment the cumulative scores for all the guesses which have the same number in that position. Keep these cumulative scores on a stack (so you can back up).
When you reach a point where no solution is possible, back up and continue searching a different path. If you back all the way up to g1 and no more paths are left to search, then the answer is NO. If you find a solution, then the answer is YES.
When to stop searching a path and back up:
If the cumulative score of one of the guesses exceeds the given score
If the cumulative score of one of the guesses is less than the given score minus the number of levels left in the search tree (before you hit the bottom)
This approach could still be very slow, especially if "k" was large. But again, it will be far faster than generating (n^k) keys.

Resources