What if log replication out-of-order of etcd raft? - etcd

I'm the newbie in etcd and have some confusion points about log replication:
For example, leader send out {term:2,index:3} and then {term:2,index:4}, the majority respond in order too. But due to network delay, leader receive the responses out of order, receive response of {term:2,index:4} first.
How etcd handle such case? It seems like just ignore the log {term:2,index:3}, commit {term:2,index:4} directly.
func (pr *Progress) MaybeUpdate(n uint64) bool {
var updated bool
if pr.Match < n {
pr.Match = n
updated = true
pr.ProbeAcked()
}
pr.Next = max(pr.Next, n+1)
return updated
}
How etcd retry when response packet(e.g. resp of {term:2,index:3}) loss happen? I can't find any code snippet to handle this in the etcd project.

Questions you asked are more raft than etcd related (etcd implements raft, so they are still relevant tho). To get high level understanding of raft algorithm I highly recommend you to to check out raft webpage and raft paper (it's really nicely written!). I believe that section 5.3 "Log replication" would be helpful.
First let's put some foundation: Leader keeps track of matching entries with every follower. It keeps information in nextIndex[] and matchIndex[] in the paper (check Fig. 2) and in ProgressMap in etcd.
// ProgressMap is a map of *Progress.
type ProgressMap map[uint64]*Progress
type Progress struct {
Match, Next uint64
...
}
Now let's jump to your questions.
For example, leader send out {term:2,index:3} and then {term:2,index:4}, the majority respond in order too. But due to network delay, leader receive the responses out of order, receive response of {term:2,index:4} first. How etcd handle such case? It seems like just ignore the log {term:2,index:3}, commit {term:2,index:4} directly.
Here all depends on state of the follower (from leader perspective). Let's dive into StateProbe and StateReplicate.
In StateProbe leader tries to figure out which entries it should send to the follower. It sends one message at the time and waits for response (which might be reject response in which case leader have to decrease Next related to this follower and retry). In this state sending 2 different MsgApp to the same follower is not possible.
In StateReplicate leader assumes that network is stable and sends (potentially) many MsgApp messages. Let's work on example.
Match := 2, Next := 2
Follower log : [E1, E2] (E stands for "entry")
Leader log: [E1, E2]
In this state leader gets put request for entries E3, E4 and E5. Let's assume that max batch size is 2 and thus all new entries can't be send in single message. Leader will send 2 messages: (Index: 3, Entries: [E3, E4]) and (Index: 5, Entries: [E5]). Second message will be send before ack for first one is obtained. In case in the picture, follower gets first message, checks if it can append it by using Index from request (check is performed in (raft).handleAppendEntries > (raftLog).maybeAppend > (raftLog).matchTerm > (raftLog).term), appends entries to it's log and sends ack. Later on, follower gets 2nd request and does the same for it (checks if it can append it and sends ack).
Fact that follower checks if it can append entries before sending ack is important here. Once leader get ack for any message it is sure that all entries up to Index + len(Entries) are populated in follower's log (otherwise this message would be rejected instead of acked). Thanks to that, it is not important if first ack is delayed (or even lost).
How etcd retry when response packet(e.g. resp of {term:2,index:3}) loss happen? I can't find any code snippet to handle this in the etcd project.
I'll focus on etcd now as in raft paper it is described as "the leader retries AppendEntries RPCs indefinitely", which is rather non constructive. Every short interval, leader sends MsgHeartbeat to the follower and latter responds with MsgHeartbeatResp. As part of MsgHeartbeatResp handling, leader does following
if pr.Match < r.raftLog.lastIndex() {
r.sendAppend(m.From)
}
Which should be read as: "If there is any entry that is not present on the follower, send him first missing entry". This can be seen as retry mechanism as pr.Match will not increase without ack from follower.

Related

How to create a shared queue in Go?

I am trying to implement the least connections algorithm for a load balancer. I am using priority queue to keep the count of connections per server in a sorted order.
Here is the code:
server = spq[0]
serverNumber = server.value
updatedPriority = server.priority + 1 // Increment connection count for server
spq.update(server, serverNumber, updatedPriority)
targetUrl, err := url.Parse(configuration.Servers[serverNumber])
if err != nil {
log.Fatal(err)
}
// Send the request to the selected server
httputil.NewSingleHostReverseProxy(targetUrl).ServeHTTP(w, r)
updatedPriority = server.priority - 1 // Decrement connection count for server
spq.update(server, serverNumber, updatedPriority)
where spq is my priority queue.
This code will run for every request the balancer will receive.
But I am not getting correct results after logging the state of queue for every request.
For example in one case I saw the queue contained the same server twice with different priorities.
I am sure this has something to do with synchronising and locking the queue across the requests. But I am not sure what is the correct approach in this particular case.
If this is really your code that runs in multiple goroutines, then you clearly have race.
I do not understand spq.update. At first it looks like it is a function that reorders the queue to have the server with minimum number of calls at element 0, but then why does it need both server and serverNumber? serverNumber appears to be a unique ID for the server, and since you already have the server, why do you need that?
In any case, you should have a sync.Mutex shared by all goroutines, and lock the mutex before the first line, and unlock after spq.update, also you should again lock it after proxy call, and unlock when all done. The line that subtracts 1 from server.priority will only work if server is a pointer. If it is not a pointer, you're losing all the updates to server happened during the call.

Does an OMNET++ / Veins simulation get very slow if both Vehicles and RSUs broadcast messages periodically?

Let me give a brief context first:
I have a scenario where the RSUs will broadcast a fixed message 'RSUmessage' about every TRSU seconds. I have implemented the following code for RSU broadcast (also, these fixed messages have Psid = -100 to differentiate them from others):
void TraCIDemoRSU11p::handleSelfMsg(cMessage* msg) {
if (WaveShortMessage* wsm = dynamic_cast<WaveShortMessage*>(msg)) {
if(wsm->getPsid()==-100){
sendDown(RSUmessage->dup());
scheduleAt(simTime() + trsu +uniform(0.02, 0.05), RSUmessage);
}
}
else {
BaseWaveApplLayer::handleSelfMsg(wsm);
}
}
A car can receive these messages from other cars as well as RSUs. RSUs discard the messages received from cars. The cars will receive multiple such messages, do some comparison stuff and periodically broadcast a similar type of message : 'aggregatedMessage' per interval Tcar. aggregatedMessage also have Psid=-100 ,so that the message can be differentiated from other messages easily.
I am scheduling the car events using self messages. (Though it could have been done inside handlePositionUpdate I believe). The handleSelfMsg of a car is following:
void TraCIDemo11p::handleSelfMsg(cMessage* msg) {
if (WaveShortMessage* wsm = dynamic_cast<WaveShortMessage*>(msg)) {
wsm->setSerial(wsm->getSerial() +1);
if (wsm->getPsid() == -100) {
sendDown(aggregatedMessage->dup());
//sendDelayedDown(aggregatedMessage->dup(), simTime()+uniform(0.1,0.5));
scheduleAt(simTime()+tcar+uniform(0.01, 0.05), aggregatedMessage);
}
//send this message on the service channel until the counter is 3 or higher.
//this code only runs when channel switching is enabled
else if (wsm->getSerial() >= 3) {
//stop service advertisements
stopService();
delete(wsm);
}
else {
scheduleAt(simTime()+1, wsm);
}
}
else {
BaseWaveApplLayer::handleSelfMsg(msg);
}
}
PROBLEM: With this setting, the simulation is very very slow. I get about 50 simulation seconds in 5-6 hours or more in Express mode in OMNET GUI. (No. of RSU: 64, Number of Vehicle: 40, around 1kmx1km map)
Also, I am referring to this post. The OP says that he got faster speed by removing the sending of message after each RSU received a message. In my case I cannot remove that, because I need to send out the broadcast messages after each interval.
Question: I think that this slowness is because every node is trying to sendDown messages at the beginning of each simulated second. Is it the case that when all vehicles and nodes schedules and sends message at the same time OMNET slows down? (Makes sense to slow down , but by what degree) But, there are only around 100 nodes overall in the simulation. Surely it cannot be this slow.
What I tried : I tried using sendDelayedDown(wsm->dup(), simTime()+uniform(0.1,0.5)); to spread the sending of the messages through out 1st half of each simulated second. This seems to stop messages piling up at the beginning of each simulation seconds and sped things up a bit, but not so much overall.
Can anybody please let me know if this is normal behavior or whether I am doing something wrong.
Also I apologize for the long post. I could not explain my problem without giving the context.
It seems you are flooding your network with messages: Every message from an RSU gets duplicated and transmitted again by every Car which has received this message. Hence, the computational time increases quadratically with the number of nodes (sender of messages) in your network (every sent message has to be handled by every node which is in range to receive it). The limit of 3 transmissions per message does not seem to help much and, as the comment in the code indicates, is not used at all, if there is no channel switching.
Therefore, if you can not improve/change your code to simply send less messages, you have to live with that. Your little tweak to send the messages in a delayed manner only distributes the messages over one second but does not solve the problem of flooding.
However, there are still some hints you can follow to improve the performance of your simulation:
Compile in release mode: make MODE=Release
Run your simulation in the terminal environment Cmdenv: ./run -u Cmdenv ...
If you need to use the graphical environment by all means, you should at least speed up the animations by using the slider in the upper part of the interface.
Removing the simtime-resolution parameter from the omnetpp.ini file solves the problem.
It seems the simulation kernel has an issue when the channel delay does not match the simulation-time resolution.
You can verify the solution by cloning the following repository. Note that you need a functional installation of the OMNeT++ framework. Specifically, I test this fix in OMNeT++ 5.6.2.
https://github.com/Ryuuba/flooding

what policy is based vehicle rerouting in case of accident?

I'm doing a scenario where there is 1 route at the beginning and then it splits into 3, then merges into 1 again. I'm stopping the car in the first lane, the second car stops also, the third one reroutes in the middle lane, the fourth goes to the queue, the fifth reroutes, the sixth goes to the queue, the seventh reroutes.....why is it behaving like that? I can't understand where to change this "balancing" or "threshold". Any suggestion?
The Veins 4.6 sample application behaves as follows:
if a vehicle is stopped (and has been driving) and has not yet sent a warning message, it sends a message containing its current road (TraCIDemo11p.cc, line 82)
if a vehicle receives a warning message, it tries to find a new route through the road network that avoids this road (TraCIDemo11p.cc, line 48); in addition, if it has not yet sent a warning message, it re-sends the received message (TraCIDemo11p.cc, line 54)
So, if you use this sample application and observe that some vehicles do not change their route, this can be attributed (among others) to either of the following effects:
they never received a message that told them about the congestion
they received a message, but couldn't find a route through the road network that would avoid the congestion
Why a vehicle did not receive a message can again be attributed (among others) to any of the following:
a warning message was transmitted to them, but they could not receive it (e.g., due to interference)
a warning message was transmitted through the network before they started their trip (remember, warning messages are not repeated by the example application)

Non-overtaking Property of MPI messages in the presence of MPI_ANY_SOURCE and MPI_ANY_TAG

Suppose I have one server that processes several kinds of requests identified by different kinds of MPI tags.
while(!stop)
{
MPI_Iprobe(MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,&ready,&status);
if(ready)
{
src = status.MPI_SOURCE;
switch(status.MPI_TAG)
{
case MPI_REQ_A:
MPI_Irecv(...);
//do something else useful while waiting
MPI_Wait(...);
break;
case MPI_REQ_B:
MPI_Irecv(...);
//do something else useful while waiting
MPI_Wait(...);
break;
}
}
//do something else before calling MPI_Iprobe() again
}
And there are N clients that send requests to the server.
If each client sends messages as in the following code:
for( int i=0; i<k ; i++ )
{
MPI_Send(....,MPI_REQ_A,..);
MPI_Send(....,MPI_REQ_B,..);
}
If MPI_Probe() or MPI_Iprobe() is used with MPI_ANY_SOURCE and MPI_ANY_TAG on the receiving side as in the code above,
Question 1) Is it still guaranteed that the message of type A sent in iteration i is received before the message of the same type sent in iteration i+1 by the same client?
Question 2) Is it still guaranteed that the message of type A sent in iteration i is
received before the message of type B sent in the same iteration i by the same client?
According to this:
If a sender sends two messages in succession to the same destination,
and both match the same receive, then this operation cannot receive
the second message if the first one is still pending.
Now, if you unroll that loop in the client, it basically does this:
MPI_Send(....,MPI_REQ_A,..); // i
MPI_Send(....,MPI_REQ_B,..); // i
MPI_Send(....,MPI_REQ_A,..); // i + 1
MPI_Send(....,MPI_REQ_B,..); // i + 1
MPI_Send(....,MPI_REQ_A,..); // i + 2
MPI_Send(....,MPI_REQ_B,..); // i + 2
...
Since the server's receives match all of those calls, the messages are guaranteed to be received in that exact order for that client. So the answer to both your questions is Yes: on any one client, messages are delivered in the order they are posted.
On a side note, there is no requirement as to the order in which messages from different clients are received. For instance, the following scenario is perfectly legal:
The server posts its first receive
All clients post their first send and block
The server matches its receive to client X's first message, and posts its second receive
Client X's first send completes, and it posts its second send
Meanwhile, the server matches its second receive to client Y's first send and posts its third receive
Client Y's first send completes, and it posts its second send
At this point, the server is free to match its third receive to client X's second send instead of any of the other clients' first send
And so forth, the server can alternate between accepting sends from client X and Y only, leaving the others starved.

Publisher finishes before subscriber and messages are lost - why?

Fairly new to zeromq and trying to get a basic pub/sub to work. When I run the following (sub starting before pub) the publisher finishes but the subscriber hangs having not received all the messages - why ?
I think the socket is being closed but the messages have been sent ? Is there a way of ensuring all messages are received ?
Publisher:
import zmq
import random
import time
import tnetstring
context=zmq.Context()
socket=context.socket(zmq.PUB)
socket.bind("tcp://*:5556")
y=0
for x in xrange(5000):
st = random.randrange(1,10)
data = []
data.append(random.randrange(1,100000))
data.append(int(time.time()))
data.append(random.uniform(1.0,10.0))
s = tnetstring.dumps(data)
print 'Sending ...%d %s' % (st,s)
socket.send("%d %s" % (st,s))
print "Messages sent: %d" % x
y+=1
print '*** SERVER FINISHED. # MESSAGES SENT = ' + str(y)
Subscriber :-
import sys
import zmq
import tnetstring
# Socket to talk to server
context = zmq.Context()
socket = context.socket(zmq.SUB)
socket.connect("tcp://localhost:5556")
filter = "" # get all messages
socket.setsockopt(zmq.SUBSCRIBE, filter)
x=0
while True:
topic,data = socket.recv().split()
print "Topic: %s, Data = %s. Total # Messages = %d" % (topic,data,x)
x+=1
In ZeroMQ, clients and servers always try to reconnect; they won't go down if the other side disconnects (because in many cases you'd want them to resume talking if the other side comes up again). So in your test code, the client will just wait until the server starts sending messages again, unless you stop recv()ing messages at some point.
In your specific instance, you may want to investigate using the socket.close() and context.term(). It will block until all the messages have been sent. You also have the problem of a slow joiner. You can add a sleep after the bind, but before you start publishing. This works in a test case, but you will want to really understand what is the solution vs a band-aid.
You need to think of the PUB/SUB pattern like a radio. The sender and receiver are both asynchronous. The Publisher will continue to send even if no one is listening. The subscriber will only receive data if it is listening. If the network goes down in the middle, the data will be lost.
You need to understand this in order to design your messages. For example, if you design your messages to be "idempotent", it doesn't matter if you lose data. An example of this would be a status type message. It doesn't matter if you have any of the previous statuses. The latest one is correct and message loss doesn't matter. The benefits to this approach is that you end up with a more robust and performant system. The downsides are when you can't design your messages this way.
Your example includes a type of message that requires no loss. Another type of message would be transactional. For example, if you just sent the deltas of what changed in your system, you would not be able to lose the messages. Database replication is often managed this way which is why db replication is often so fragile. To try to provide guarantees, you need to do a couple things. One thing is to add a persistent cache. Each message sent needs to be logged in the persistent cache. Each message needs to be assigned a unique id (preferably a sequence) so that the clients can determine if they are missing a message. A second socket (ROUTER/REQ) needs to be added for the client to request the missing messages individually. Alternatively, you could just use the secondary socket to request resending over the PUB/SUB. The clients would then all receive the messages again (which works for the multicast version). The clients would ignore the messages they had already seen. NOTE: this follows the MAJORDOMO pattern found in the ZeroMQ guide.
An alternative approach is to create your own broker using the ROUTER/DEALER sockets. When the ROUTER socket saw each DEALER connect, it would store its ID. When the ROUTER needed to send data, it would iterate over all client IDs and publish the message. Each message should contain a sequence so that the client can know what missing messages to request. NOTE: this is a sort of reimplementation of Kafka from linkedin.

Resources