How to make sense of Phase 2 in Paxos distributed consensus algorithm? - algorithm

I have pasted pseudocode for a paxos algorithm here:
What is a "view" in the Paxos consensus algorithm?
and was wondering if someone could point me in the right direction.
The algorithm says that each node has a "state" which contains a bunch of information the node should keep track of.
Suppose we have two nodes: Node #1 and Node #2. In the simplest case Node #2 joins Node #1 and they both play paxos. What exactly happens to the states of Node #1 and Node #2 after 2 joins 1? When does the "views" data structure change and what does it contain? If someone can explain to me this simple case of two nodes playing paxos, then I think I can figure out multiple node case.
My current understanding (which I'm pretty sure is not correct) is as follows:
Node #2 sends a message to join Node #1.
Node #1 receives message from Node #2 asking to join.
Node #1 assumes leadership and kicks into phase 1, computes my_num = max(0,0) + 1 = 1
Node #1 sends all nodes in views[0] (which is empty) prepare(1,1)
Node #1 sends initial contact node (Node #2) prepare(1,1)
Node #1 sends Node #1 (itself) prepare(1,1)
Node #2 receives prepare(1,1). It sets its num_h=1 and returns to leader PROMISE(0,{empty list})
Node #1 receives prepare(1,1) and sets its num_h=1 and returns itself PROMISE(0,{empty list}).
now we get to phase 2
This is where I am quite confused.
Node #1 is the leader and it receives two PROMISE(0,{empty list}) messages. According to the algorithm, if the leader gets a majority of promises in views[0] then it can set a value for "v" and send the ACCEPT message to all responders.
What I am confused about is currently the views[0] for the leader is empty so how can you compute the majority of an empty list?
Also, let's assume the leader has received a majority of promises and proceeds to set v = set of pingable nodes (including self). What exactly are pingable nodes? Is it just Node #1 and Node #2?
Would appreciate all / any help and will definitely award points to those that help.

The pseudocode is not specialized for a particular problem. In fact, the professor said we don't need to use pseudocode if we don't want to and said we can look at other Paxos papers (i.e. paxos made simple, paxos made live,etc..) for guidance on implementing this algorithm. Maybe you are right, I should probably look at Wikipedia to implement this algorithm. So the views[..] is simply a hash map of and a node can choose any value it wishes to. If I understand you correctly, you say the majority statement just checks if it received "enough" PROMISE messages. But the only way to know if we have enough is if the node keeps track of who its group members are. Which means that I need a different data structure for this.
Also, I cannot award points to you because you made a comment. If you post an answer then I can give you points.

Related

Why do we make clone for some nodes in Suffix Automata Algorithm?

I have been studying about Suffix Automata string matching algorithm for a few days. I watched these videos and reed documents but I really can't get why we need to make a new node (under special condition) and clone it. I know how it works now but I am eager to learn the reason behind it. What would be the problem if we keep previous nodes? for example in the picture below we have new node (red Circle) for 'b' character. Can some one explain it to me? Appreciate.
There's no difference for your test case.
Another test case abbcbb. Which node should string bb be belonging to?
So clone a node is necessary to guarantee that the node corresponding to each substring is unique.

What is the difference between linkedlist and queue?

I'm new to data structures and it seems like both data structures have more similarities.
In this answer it says that there is a difference in interface.
Please explain it.
A queue is any data structure that is "FIFO = First-In First-Out." It's a waiting-list. (In Britain, the term is used in ordinary conversation ... you "wait in a queue" not "wait in line.")
A stack is any data structure that is "LIFO = Last-In First-Out." It's a pushdown stack like the stack of dishes in a cafeteria.
Linked lists are a possible implementation of either such structure. It consists of nodes which contain pointers to adjacent nodes in the list.
However, there are many other implementations. "Trees" of various kinds can also be used to implement both queues and stacks. Ordinary arrays can do it, although of course arrays cannot "grow."
Ideally, these days, you simply use an appropriate "container class" in your favorite language and fuhgeddabout how it actually was implemented. "You know that it works, therefore you don't care how." Actual implementation of such things is probably an academic exercise.
List is just a list of things (items, objects whatever). For example a list of courses you are taking in your semester. A list of songs that you are listening. A list of answers of this question on this page. There is no order associate with a list. You can add an item to a list anywhere, you can take an item off the list from anywhere, it doesn't change the definition of a list. Its just a grouping of similar (or not so similar) items.
Now consider a list of people standing in front of ATM machine or a bank teller. This list has to observe a particular order. The first person in the line (list) is the one that will be served first (and will be the first to leave this list). A new person coming in will be standing as a last person in the queue and will be served after everyone in front of him has been served. The people in middle of the list are not supposed to jump the line. This is an example of a Queue. You can also guess what a priority Queue would be (think Airlines with silver and gold members on check-ins).
I hope this explains the difference.
A link list is a list of nodes. Each node contain an address field and that address field holding the address of its next node. The reason for this kind of structure is to traverse through the list from its first node till last node. This type of structure is called singly link list. A link list can also be doubly linked, in that structure a node will have two address field where one field will store the address of its previous node and one address will hold the address of its next node. Most important thing of a link list is that its first node address must be stored in an address variable so that we can traverse through the link list at any time.
But Queue can be a link list or an array of nodes. In a list a node can be insert at any place. But in queue a new node must be inserted at the beginning of the list. A queue is working on basis of FIFO i.e. first in first out basis. Hence when you use the pop command on a queue if it is a link list it must remove the last node of the list and return the value of that last node. Hence a queue can be a list as well but with a principle called FIFO based.
You will get more information online. Read properly and try to understand the difference.
I had the same question as you! This is what I found:
A Queue is essentially just more restrictive than a LinkedList. For example, in a LinkedList you can use the method .add(int index, Object obj), but if you try doing that with a Queue interface, you’ll get an error, since with a Queue you can only add elements at the tail end. Similarly, in a LinkedList you can use .remove(int index) as well as .remove(Object obj), but attempting to do this with a Queue will result in an error, since you can only remove an object from the head. So, in essence, the Queue just has less options when it comes to methods you can use on it. (There might be more to it than that, but that’s what was most pertinent to me.)
There are some similarities between the two. For example, they both have the .poll() method, and the result is the same: removes the head element from the Object.
Here are some links in which you can compare the methods of the two (scroll to the bottom of each page to see them all, and you’ll see immediately that LinkedList has a lot more):
https://www.geeksforgeeks.org/linked-list-in-java/ (LinkedList)
https://www.geeksforgeeks.org/queue-interface-java/ (Queue)
In Java (and probably other languages too), a LinkedList implements the Queue interface. So in essence, a LinkedList is a Queue; it has all features that a Queue does and more. Keep in mind, a Queue is not a LinkedList, as a LinkedList is built and expanded upon a Queue.
See this:

Find Top 10 Most Frequent visited URl, data is stored across network

Source: Google Interview Question
Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs.
Have many large <string (url) -> int (visits)> maps.
Calculate < string (url) -> int (sum of visits among all distributed maps), and get the top ten in the combined map.
Main constraint: The maps are too large to transmit over the network. Also can't use MapReduce directly.
I have now come across quite a few questions of this type, where processiong needs to be done over large Distributed systems. I cant think or find a suitable answer.
All I could think of is brute force, which in some or other way, violates the given constraint.
It says you can't use map-reduce directly which is a hint the author of the question wants you to think how map reduce works, so we will just mimic the actions of map-reduce:
pre-processing: let R be the number of servers in cluster, give each
server unique id from 0,1,2,...,R-1
(map) For each (string,id) - send the tuple to the server which has the id hash(string) % R.
(reduce) Once step 2 is done (simple control communication), produce the (string,count) of the top 10 strings per server. Note that the tuples where those sent in step2 to this particular server.
(map) Each server will send all his top 10 to 1 server (let it be server 0). It should be fine, there are only 10*R of those records.
(reduce) Server 0 will yield the top 10 across the network.
Notes:
The problem with the algorithm, like most big-data algorithms that
don't use frameworks is handling failing servers. MapReduce takes
care of it for you.
The above algorithm can be translated to a 2 phases map-reduce algorithm pretty straight forward.
In the worst case any algorithm, which does not require transmitting the whole frequency table, is going to fail. We can create a trivial case where the global top-10s are all at the bottom of every individual machines list.
If we assume that the frequency of URIs follow Zipf's law, we can come up with effecive solutions. One such solution follows.
Each machine sends top-K elements. K depends solely on the bandwidth available. One master machine aggregates the frequencies and finds the 10th maximum frequency value "V10" (note that this is a lower limit. Since the global top-10 may not be in top-K of every machine, the sum is incomplete).
In the next step every machine sends a list of URIs whose frequency is V10/M (where M is the number of machines). The union of all such is sent back to every machine. Each machines, in turn, sends back the frequency for this particular list. A master aggregates this list into top-10 list.

What are the performance benefits of keeping head node without data in linkedlist?

What are the performance benefit of keeping head node with useless data in linkedlist?
I read some implementation of linkedlist operations using head node which keeps only pointer to the first node( data in head node is useless).
But i am unable to figure out even a single benefit of using head node instead of head pointer.
Can someone please clarify with 1 problem and 2 implementations, one with head node and another with head pointer and tradeoff between performance/complexity?
Dummy header nodes are used for these reasons:
To satisfy the requirement that every node has a previous node (makes the methods simpler/uniform)
Not to deal with the special cases of inserting into and deleting from the head. It is just like any other node
Other than that, no, they do not provide any performance/memory advantage.
Among other things, one more advantage is that, you can keep the number of currently available nodes in the linked list, as nodes are added and removed frequently, so you keen keep the record of the number of nodes in the head node of the linked list.

Need some help understanding this problem about maximizing graph connectivity

I was wondering if someone could help me understand this problem. I prepared a small diagram because it is much easier to explain it visually.
alt text http://img179.imageshack.us/img179/4315/pon.jpg
Problem I am trying to solve:
1. Constructing the dependency graph
Given the connectivity of the graph and a metric that determines how well a node depends on the other, order the dependencies. For instance, I could put in a few rules saying that
node 3 depends on node 4
node 2 depends on node 3
node 3 depends on node 5
But because the final rule is not "valuable" (again based on the same metric), I will not add the rule to my system.
2. Execute the request order
Once I built a dependency graph, execute the list in an order that maximizes the final connectivity. I am not sure if this is a really a problem but I somehow have a feeling that there might exist more than one order in which case, it is required to choose the best order.
First and foremost, I am wondering if I constructed the problem correctly and if I should be aware of any corner cases. Secondly, is there a closely related algorithm that I can look at? Currently, I am thinking of something like Feedback Arc Set or the Secretary Problem but I am a little confused at the moment. Any suggestions?
PS: I am a little confused about the problem myself so please don't flame on me for that. If any clarifications are needed, I will try to update the question.
It looks like you are trying to determine an ordering on requests you send to nodes with dependencies (or "partial ordering" for google) between nodes.
If you google "partial order dependency graph", you get a link to here, which should give you enough information to figure out a good solution.
In general, you want to sort the nodes in such a way that nodes come after their dependencies; AKA topological sort.
I'm a bit confused by your ordering constraints vs. the graphs that you picture: nothing matches up. That said, it sounds like you have soft ordering constraints (A should come before B, but doesn't have to) with costs for violating the constraint. An optimal algorithm for scheduling that is NP-hard, but I bet you could get a pretty good schedule using a DFS biased towards large-weight edges, then deleting all the back edges.
If you know in advance the dependencies of each node, you can easily build layers.
It's amusing, but I faced the very same problem when organizing... the compilation of the different modules of my application :)
The idea is simple:
def buildLayers(nodes):
layers = []
n = nodes[:] # copy the list
while not len(n) == 0:
layer = _buildRec(layers, n)
if len(layer) == 0: raise RuntimeError('Cyclic Dependency')
for l in layer: n.remove(l)
layers.append(layer)
return layers
def _buildRec(layers, nodes):
"""Build the next layer by selecting nodes whose dependencies
already appear in `layers`
"""
result = []
for n in nodes:
if n.dependencies in flatten(layers): result.append(n) # not truly python
return result
Then you can pop the layers one at a time, and each time you'll be able to send the request to each of the nodes of this layer in parallel.
If you keep a set of the already selected nodes and the dependencies are also represented as a set the check is more efficient. Other implementations would use event propagations to avoid all those nested loops...
Notice in the worst case you have O(n3), but I only had some thirty components and there are not THAT related :p

Resources