Com port queue latency metering - algorithm

I have two programs (host and slave) communicating over a com port. In the simplest case, the host sends a command to the slave and waits for a response, then does it again. But this means that each side has to wait for the other for every transaction. So I use a queue so the second command can be sent before the first response comes back. This keeps things flowing faster.
But I need a way of metering the use of the queue so that there are never more than N command/response pairs in route at any time. So for example if N is 3, I will wait to send the fourth command until I get the first response back, etc. And it must keep track of which response goes with which command.
One thought I had is to tag each command with an integer modulo counter which is also returned with the response. This would ensure that the command and response are always paired correctly and I can do a modulo compare to be able to meter the commands always N ahead of the responses.
What I am wondering, is there a better way? Isn't this a somewhat common thing to do?
(I am using Python, but that is not important.)

Using a sequence number and modulo arithmetic is in fact quite a common way to both acknowledge messages received and tell the sender when it can send more messages - see e.g. http://en.wikipedia.org/wiki/Sliding_window_protocol. Unfortunately for you, the obvious example, TCP, is unusual in that it uses a sequence number based on byte counts, not message counts, but the principle is much the same - TCP just has an extra degree of flexibility.

Related

Why is it legit to take the next two commands to fill gaps between paxos events?

There is a point in Paxos algorithm (http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf) that I do not understand. It's about how to deal with the gaps, the paper describe two ways as below:
The leader, as well as any other server that learns all the commands the leader knows, can now execute commands 1–135. However, it can’t execute commands 138–140, which it also knows, because commands 136 and 137 have yet to be chosen. The leader could take the next two commands requested by clients to be commands 136 and 137. Instead, we let it fill the gap immediately by proposing, as commands 136 and 137, a special “no- op” command that leaves the state unchanged. (It does this by executing phase 2 of instances 136 and 137 of the consensus algorithm.) Once these no-op commands have been chosen, commands 138–140 can be executed.
take the next two commands requested by clients
special “no- op” command
The second option has been mentioned Why is it legit to use no-op to fill gaps between paxos events.
And My question is about the first one. In my opinion, take the next two commands will violate the consistency, since the instance happened later may be have a smaller sequence number. So why it is still legit?
Since all clients see the same consistent outcome there isn't a violation of consistency. So there isn't a violation of the algorithms invariants.
If you consider the scenario where all the commands come from a single client then it would be a reordering compared to the order the client sent the values. If a single client is multi-threaded and if it streams multiple concurrent requests the reordering may be harmless (or not, depending on the application semantics). If you consider that the leader uses noops then it effectively just drops some messages which may not be harmless to a client that depends on the ordering of values it streams. It depends on the application.
If you consider the scenario where all the values come from different clients then the situation is far more natural. Under averse conditions some reordering occurs. Yet under normal running that doesn't happen. The reordering it just looks like some values "took longer than normal" to be fixed by a leader while later values issued by other clients "ran faster".
the first option and the second option is the same.
such as in this example, the client want write 4 commands,
a = 1
b = 2
c = 3
d = 4
in the first option, the result may be
a = 1
e = 5
f = 6
d = 4
And in the second option, the result is
a = 1
noop
noop
d = 4
so, both of the results are illegal. There is no difference between lost data and violate the order in this problem.
Then as #simbo1905 said, multi-Paxos don't promise the FIFO order.
if 136, 137, and 138 have order relation, such as they are sent by one TCP connection, and the client sends these three commands with a pipeline. It is the client's responsibility that makes these operations in FIFO order. If the client has many outgoing commands and the client want FIFO client order, the client needs to retry the command from the first failure command.
The other scenario is that they are sent by different connections. Since the are send by different connection, the server can't promise the FIFO client order. The 136, 137 operations are failed, any scenario is acceptable, these two operations can be successful or can be failed. If the client wants to know the result, the client should retry the operation.
In both scenarios, it is the client the responsibility to promise the order, not the server.
So I think you misunderstanding the meaning of consistency, consistency has no relation with order. It is about safety and liveness

Why is it legit to use no-op to fill gaps between paxos events?

I am learning Paxos algorithm (http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf) and there is one point I do not understand.
We know that events follow a timely order, and it happens when, say, events 1-5 and 10 are decided, but 6-9 and 11 thereafter are not yet. In the paper above, it says we simply fill in the gap between 6-9 with no-op values, and simply record new events from 11 and on.
So in this case, since event 10 is already recorded, we know some kinds of events must have happened between 5 and 10 but are not recorded by Paxos due to some failures. If we simply fill in no-op values, these events will lost in our recording.
Even worse, if, as the paper I linked above says, events are in fact commands from the client, then missing a few commands in the middle might make the entire set of operations illegal (if none of the commands can be skipped or the order of them matters).
So why is it still legit for Paxos to fill no-op values for gaps between events? (If the entire set of records might be invalid because of no-op values as I concerned above.) Also, is there a better way to recover from such gaps instead of using no-op?
This is a multi-part answer.
Proposing no-op values is the way to discover commands that haven't got to the node yet. We don't simply fill each slot in the gap with a no-op command: we propose each slot is filled with a no-op. If any of the peers have accepted a command already, it will return that command in the Prepare-ack message and the proposer will use that command in the Accept round instead of the no-op.
For example, assume a node was behind a temporary network partition and was unable to play with the others for slots 6-9. It knows it missed out upon learning the command in slot 10. It then proposes no-ops to learn what was decided in those slots.
Practical implementations also have an out-of-band learning protocol to learn lots of transitions in bulk.
A command isn't a command until it is fully decided; until then it is just a proposed command. Paxos is about choosing between contending commands from multiple clients. Clients must be prepared to have their commands rejected because another client's was chosen instead.
Practical implementations are all about choosing the order of client commands. Their world view is that of a write-ahead log, and they are placing the commands in that log. They retry in the next slot if they're command wasn't chosen. (There are many ways to reduce the contention; Lamport mentions forwarding requests to a leader, such as is done in Multi-Paxos.)
Practical systems also have some means to know if the command is invalid before proposing it; such as knowing a set of reads and a set of writes. This is important for two reasons. First, it's an asynchronous, multi-client system and anything could have changed by the time the client's command has reached the server. Second, if two concurrent commands do not conflict then both should be able to succeed.
The system model allows commands (messages) to be lost by the network anyway. If a message is lost, the client is expected to eventually retry the request; so it is fine to drop some of them. If the commands of a client have to executed in client order, then either the client only sends commands synchronously; or the commands have to be ordered at a higher level in the library and kept in some client-session object before being executed.
AFAIK the Zab protocol guarantees client-order, if you don't want to implement that at a higher level.

Measure RTT of WebsocketServer: Single Message or bulk of messages?

I recently changed something in the WebsocketServer implementation of one of my projects, that (presumably) decreases the Round-Trip-Time (RTT) of its messages. I want to measure the RTT of both implementations to compare them. For this, I am sending m messages of s bytes from c different connections.
The thing I am wondering now is: Should the RTT be measured for each message separately - remembering the sending time of each message and when the response arrives -, or should I remember the time the first message was sent and the response to the last message arrives? Which one is more accurate?
I would probably go for the first approach, what made me wondering is that websocket-benchmark seems to use the latter approach. Is this just an oversight or is there a reasoning behind it?

How do udp sockets actually work internally?

I am trying to reduce packets manipulation to its minimum in order to improve efficiency of a specific program i am working on but i am struggling with the time it takes to send through a udp socket using sendto/recvfrom. I am using 2 very basic processes (applications), one is sending, the other one receiving.
I am willing to understand how linux internally works when using these function calls...
Here are my observations:
when sending packets at:
10Kbps, the time it takes for the messages to go from one application to the other is about 28us
400Kbps, the time it takes for the messages to go from one application to the other is about 25us
4Mbps, the time it takes for the messages to go from one application to the other is about 20us
40Mbps, the time it takes for the messages to go from one application to the other is about 18us
When using different CPUs, time is obviously different and consistent with those observations. There must be some sort of setting that enables some queue readings to be done faster depending on the traffic flow on a socket... how can that be controlled?
When using a node as a forwarding node only, going in and out takes about 8us when using 400Kbps flow, i want to converge to this value as much as i can. 25us is not acceptable and deemed to slow (it is obvious that this is way less than the delay between each packet anyway... but the point is to be able to eventually have a greater deal of packets to be processed, hence, this time needs to be shortened!). Is there anything faster than sendto/recvfrom (must use 2 different applications (processes), i know i cannot use a monolitic block, thus i need info to be sent on a socket)?

Unclear on Nagle's Algorithm

I've been doing some research on Nagle's algorithm out of idle curiousity. I understand the basic concept behind it (TCP packets contain a significant amount of overhead especially when dealing with small payloads), but I'm not sure I grok the implementation.
I was reading this article on Wikipedia, but I'm still unclear on how it works. Let's take the example of a Telnet connection. The connection is established and I begin typing. Let's say I type three characters (cat, for example) and hit return. Now we're talking cat\r\n which is still only 5 bytes. I'd think this would not get sent until we queue up enough bytes to send - and yet, it does get sent immediately (from a user perspective), since cat is immediately executed upon hitting return.
I think I have a fundamental misunderstanding here on how the algorithm works, specifically regarding the bit where "if there is unconfirmed data still in the pipe, enqueue, else send immediately."
The data gets sent immediately only if the server has already responded to any previous messages from you (or this is your first contact with it in this session). So, as the server gets busier and slower to respond, in order to avoid swamping it with too many packets, the data gets queued up to a maximum packet size before getting sent.
So whether data gets sent immediately or not only can be determined in the context of previous messages, if any.
Read this post, it is quite in-depth and clarified a lot of the things for me.

Resources