WebSocket frame fragmentation in an API - websocket

Would exposing a WebSocket fragmentation have any value in a client-side API?
Reading the RFC 6455 I became convinced a non-continuation frame doesn't guarantee you anything in terms of its semantics. One shouldn't rely on frame boundaries. It's just too risky. The spec addresses this explicitly:
Unless specified otherwise by an extension, frames have no semantic
meaning. An intermediary might coalesce and/or split frames, if no
extensions were negotiated by the client and the server or if some
extensions were negotiated, but the intermediary understood all the
extensions negotiated and knows how to coalesce and/or split frames
in the presence of these extensions. One implication of this is that
in absence of extensions, senders and receivers must not depend on
the presence of specific frame boundaries.
Thus receiving a non-continuation frame of type Binary or Text doesn't mean it's something atomic and meaningful that has been sent from the other side of the channel. Similarly a sequence of continuation frames doesn't mean that coalescing them will yield a meaningful message. And what's even more upsetting,
a single non-continuation type frame may be a result of coalescing many other frames.
To sum up, groups of bytes sent over the WebSocket may be received regrouped pretty much any way, given the byte order is the same (that's of course in absence of extensions).
If so, then is it useful to introduce this concept at all? Maybe it's better to hide it as a detail of implementation? I wonder if WebSocket users have found it useful in such products like Netty, Jetty, Grizzly, etc. Thanks.

Fragmentation is not a boundary for anything.
It's merely a way for the implementation to handle itself based on memory, websocket extensions, performance, etc.
A typical scenario would be a client endpoint sending text, which is passed through the permessage-deflate extension, which will compress and generate fragments based on its deflate algorithm memory configuration, writing those fragments to the remote endpoint as it has a buffer of compressed data to write (some implementations will only write if the buffer is full or the message has received its final byte)
While exposing access to the fragments in an API has happened (Jetty has 2 core websocket APIs, both that support fragment access), its really only useful for those wanting lower level control on streaming applications. (think video / voip where you want to stream with quality adjustments, dropping data if need be, not writing too fast, etc ...)

There seems to be some ambiguity in the RFC concerning unfragmented messages, that they can be split or combined arbitrarily. But, in the situation where a message is deliberately sent as multiple fragments (totalling X bytes), is it allowable for an intermediary to split some of these frames in a way that returns a different number (than X) of bytes in the sequence? I don't think that is allowed and fragmentation has some value in that respect. This is just from reading the RFC, as opposed to looking at real implementations.
The fragments of one message MUST NOT be interleaved between the
fragments of another message unless an extension has been
negotiated that can interpret the interleaving.
To my reading this implies that unless some extension has been negotiated which allows it, fragments from different messages cannot be interleaved and this means that while the number of fragments can be altered, the exact number of bytes (and the bytes themselves) cannot be.

There should be support for controlling fragmentation; We have a C# program that intentionally splits a large WebSocket message into small fragments so a small embedded processor receiving the data can process small chunks at a time. Instead it is arriving completely coalesced into a single large block consuming most of the available memory.
We are not sure where the coalescing is taking place. Maybe the C# library.

Related

Using NdisFIndicateReceiveNetBufferLists for every packet vs chaining them all together to receive?

I have an NDIS driver where i send received packets to the user service, then the service marks those packets that are OK (not malicious), then i iterate over the packets that are good to receive then i send them one by one by by converting each of them back to a proper NetBufferList with one NetBuffer and then i indicate them using NdisFIndicateReceiveNetBufferLists.
This caused a problem that in large file transfers through SMB (copying files from shares), which reduced the transfer speed significantly.
As a workaround, i now chain all of the NBLs that are OK altogether (instead of sending them one by one), and then send all of them at once via NdisFIndicateReceiveNetBufferLists.
My question is, will this change cause any issue? Any difference between sending X number of NBLs one by one vs chaining them together and sending all of them at once? (since most of them might be related to different flows/apps)
Also, the benefit of chaining packets together is much greater in multi packet receive compared to multi packet send via FilterSendNetBufferLists, why is that?
An NET_BUFFER represents one single network frame. (With some appropriate hand-waving for LSO/RSC.)
An NET_BUFFER_LIST is a collection of related NET_BUFFERs. Each NET_BUFFER on the same NET_BUFFER_LIST belong to the same "traffic flow" (more on that later), they have all the same metadata and will have all the same offloads performed on them. So we use the NET_BUFFER_LIST to group related packets and to have them share metadata.
The datapath generally operates on batches of multiple NET_BUFFER_LISTs. The entire batch is only grouped together for performance reasons; there's not a lot of implied relation between multiple NBLs within a batch. Exception: most datapath routines take a Flags parameter that can hold flags that make some claims about all the NBLs in a batch, for example, NDIS_RECEIVE_FLAGS_SINGLE_ETHER_TYPE.
So to summarize, you can indeed safely group multiple NET_BUFFER_LISTs into a single indication, and this is particularly important for perf. You can group unrelated NBLs together, if you like. However, if you are combining batches of NBLs, make sure you clear out any NDIS_XXX_FLAGS_SINGLE_XXX style flags. (Unless, of course, you know that the flags' promise still holds. For example, if you're combining 2 batches of NBLs that both had the NDIS_RECEIVE_FLAGS_SINGLE_ETHER_TYPE flag, and if you verify that the first NBL in each batch has the same EtherType, then it is actually safe to preserve the NDIS_RECEIVE_FLAGS_SINGLE_ETHER_TYPE flag.)
However note that you generally cannot combine multiple NET_BUFFERs into the same NET_BUFFER_LIST, unless you control the application that generated the payload and you know that the NET_BUFFERs' payloads belong to the same traffic flow. The exact semantics of a traffic flow are a little fuzzy down in the NDIS layer, but you can imagine it means that any NDIS-level hardware offload can safely treat each packet as the same. For example, an IP checksum offload needs to know that each packet has the same pseudo-header. If all the packets belong to the same TCP or UDP socket, then they can be treated as the same flow.
Also, the benefit of chaining packets together is much greater in multi packet receive compared to multi packet send via FilterSendNetBufferLists, why is that?
Receive is the expensive path, for two reasons. First, the OS has to spend CPU to demux the raw stream of packets coming in from the network. The network could send us packets from any random socket, or packets that don't match any socket at all, and the OS has to be prepared for any possibility. Secondly, the receive path handles untrusted data, so it has to be cautious about parsing.
In comparison, the send path is super cheap: the packets just fall down to the miniport driver, who sets up a DMA and they're blasted to hardware. Nobody in the send path really cares what's actually in the packet (the firewall already ran before NDIS saw the packets, so you don't see that cost; and if the miniport is doing any offload, that's paid on the hardware's built-in processor, so it doesn't show up on any CPU you can see in Task Manager.)
So if you take a batch of 100 packets and break it into 100 calls of 1 packet on the receive path, the OS has to grind through 100 calls of some expensive parsing functions. Meanwhile, 100 calls through the send path isn't great, but it'll be only a fraction of the CPU costs of the receive path.

What is the relationship between request content size and request duration

At the company I work, all our APIs send and expect requests/responses that follow the JSON:API standard, making the structure of the request/response content very regular.
Because of this regularity and the fact that we can have hundreds or thousands of records in one request, I think it would be fairly doable and worthwhile to start supporting compressed requests (every record would be something like < 50% of the size of its JSON:API counterpart).
To make a well informed judgement about the viability of this actually being worthwhile, I would have to know more about the relationship between request size and duration, but I cannot find any good resources on this. Anybody care to share their expertise/resources?
Bonus 1: If you were to have request performance issues, would you look at compression as a solution first, second, last?
Bonus 2: How does transmission overhead scale with size? (If I cut the size by 50%, by what percentage will the transmission overhead be cut?)
Request and response compression adds to a time and CPU penalty on both sender's side and receiver's side. The savings in time is in the transmission.
The weighing of the tradeoff depends a lot on the customers of the API -- when they make requests, how much do they request, what is requested, where they are located, type of device/os and capabilities etc.,
If the data is static -- for eg: a REST query apihost/resource/idxx returning a static resource, there are web standard approaches like caching of static resources that clients / proxies will be able to assist with.
If the data is dynamic -- there are architectural patterns that could be used.
If the data is huge -- eg: big scientific data sets, video etc., almost always you would find them being served statically with a metadata service that provides the dynamic layer. For eg: MPEG-DASH or HLS is just a collection of files.
I would choose compression as a last option relative to the other architectural options.
There are also implementation optimizations that would precede using compression of request/response. For eg:
Are your services using all available resources at disposal (cores, memory, i/o)
Does the architecture allow scale-up and scale-out and can the problem be handled effectively using that (remember the penalties on client side due to compression)
Can you use queueing, caching or other mechanisms to make things appear faster?
If you have explored all these and the answer is your system is optimal and you are looking at the most granular unit of service where data volume is an issue, by all means go after compression. Keep in mind that you need to budget compute resources for compression on the server side as well (for a fixed workload).
Your question#2 on transmission overhead vs size is a question around bandwidth and latency. Bandwidth determines how much you can push through the pipe. Latency governs the perceived response times. Whether the payload is 10 bytes or 10MB, latency for a client across the world encountering multiple hops will be larger relative to a client encountering only one or two hops and is bound by the round-trip time. So, a solution may be to distribute the servers and place them closer to your clients from across the world rather than compressing data. That is another reason why compression isn't the first thing to look at.
Baseline your performance and benchmark your experiments for a representative user base.
I think what you are weighing here is going to be the speed of your processor / cpu vs the speed of your network connection.
Network connection can be impacted by things like distance, signal strength, DNS provider, etc; whereas, your computer hardware is only limited by how much power you've put in it.
I'd wager that compressing your data before you are sending would result in shorter response times, yes, but it's=probably going to be a very small amount. If you are sending json, usually text isn't all that large to begin with, so you would probably only see a change in performance at the millisecond level.
If that's what you are looking for, I'd go ahead and implement it, set some timing before and after, and check your results.

Is ZeroMQ slower than boost asio?

I am trying to write a network transfer application.
The data is binary data and each packet size is mostly 800KB.
The client produces 1000 data per second. I want transfer data as quick as possible.
When I use ZeroMQ, the speed hits 350 data per second, but the boost asio hits 400(or more) per second.
As you can see the performance of both methods is not good.
The pattern used for ZeroMQ is a PUSH/PULL pattern, the boost asio is simple sync I/O.
Q1: I want to ask, is ZeroMQ only suitable for small messages?
Q2: Is there a way to improve the ZeroMQ speed?
Q3: If ZeroMQ can't, please advice some good method or library to improve these kind of data transfer.
Data Rate
You're attempting to move 800 MByte/second. What sort of connection is this? For a tcp:// transport-class it'd have to something pretty rapid, e.g. 100 Gbit/s Ethernet, which is pretty exotic.
So I'm presuming that it's an ipc:// transport-class connection. In which case you can get an improvement, using ZeroMQ zerocopy functions, which saves copying the data repeatedly.
With a normal transfer, you have to copy data into a zmq message, that has to be copied into an ipc pipe, copied out again, and copied back into a new zmq message at the receiving end. All that copying requires 4 x 800 = 2.4 GByte/sec memory bandwidth which, by the time cache conflicts have come into play, is an appreciable percentage of the total memory bandwidth of a typical PC system. Using zerocopy should cut that in half.
Alternative to Zero Copy - Zero Transfer
If you are using ipc://, then consider not sending data through the sockets, but sending references to the data through the sockets.
I have previously blended use of zmq and a semaphore locked C++ stl::queue, using zmq simply for it's pattern ( PUSH/PULL in my case ), the stl::queue to carry shared pointers to data, and leave the data still. The sender locks the queue, puts a shared pointer into it, and then sends a simple message ( e.g. "1" ) through a zmq socket. The recipient reads the "1" and uses that as a cue to lock the queue and pull a shared pointer off it. Thus a shared pointer to data has been transferred from one thread to another in a ZMQ pattern via a stl::queue, but the data itself has stayed still. All I've done is pass ownership of the data between threads. It works so long as the shared pointer that the send has goes out of scope immediately after sending and is not used by the sender to modify or access the data.
PUSH/PULL is not too bad to deal with - each message goes to only one recipient. It would take more effort to make such a blend with PUB/SUB, and received messages would have to be treated as read-only because each recipient would have a shared pointer to the same block of data as everyone else.
Message Size
I've not idea how big a chunk zmqtp transfers at a time, but I'd guess that it's relatively efficient in terms of protocol:data ratio.

What type of framing to use in serial communication

In a serial communication link, what is the prefered message framing/sync method?
framing with SOF and escaping sequences, like in HDLC?
relying on using a header with length info and CRC?
It's an embedded system using DMA transfers of data from UART to memory.
I think the framing method with SOF is most attractive, but maybe the other one is good enough?
Does anyone has pros and cons for these two methods?
Following based on UART serial experience, not research.
I have found fewer communication issues when the following are included - or in other words, do both SOF/EOF and (length - maybe)/checkcode. Frame:
SOFrame
(Length maybe)
Data (address, to, from, type, sequence #, opcode, bytes, etc.)
CheckCode
EOFrame
Invariably, the received "fames" include:
good ones - no issues.
Corrupt due to sender not sending a complete message (it hung, powered down, or rump power-on transmission) (receiver should timeout stale incomplete messages.)
Corrupt due to noise or transmission interference. (byte framing errors, parity, incorrect data)
Corrupt due to receiver starting up in the middle of a sent message or missing a few bytes due to input buffer over-run.
Shared bus collisions.
Break - is this legit in your system?
Whatever framing you use, insure it is robust to address these messages types, promptly validate #1 and rapidly identifying 2-5 and becoming ready for the next frame.
SOF has the huge advantage of it it easy to get started again, if the receiver is lost due to a previous crap frame, etc.
Length is good, but IMHO the least useful. It can limit through-put, if the length needs to be at the beginning of a message. Some low latency operations just do not know the length before they are ready to begin transmitting.
CRC Recommend more than 2-byte. A short check-code does not improve things enough for me. I'd rather have no check code than a 1-byte one. If errors occur from time-to-time only be caught by the check-code, I want something better than a 2-byte's 99.999% of the time, I like a 4-byte's 99.99999997%
EOF so useful!
BTW: If your protocol is ASCII (instead of binary), recommend to not use cr or lf as EOFrame. Maybe only use them out-of-frame where they are not part of a message.
BTW2: If your receiver can auto-detect the baud, it saves on a lot of configuration issues.
BTW3: A sender could consider sending a "nothing" byte (before the SOF) to insure proper SOF syncing.

How is it possible to limit download speed?

Lately I've asked this question. But the answer doesn't suit my demands, and I know that file hosting providers do manage to limit the speed. So I'm wondering what's the general algorithm/method to do that (I do mean downloading technique) - in particular limiting single connection/user download speed.
#back2dos I want to give a particular user a particular download speed (corresponding to hardware capabilities of course) or in other words give user ability to download some particular file with lets say 20kb/s. Surely I want to have an ability to change that to some other value.
You could use a token bucket ( http://en.wikipedia.org/wiki/Token_bucket)
Without mention of platform/language, it's difficult to answer, but a "leaky bucket" algorithm would probably be the best fit:
http://en.wikipedia.org/wiki/Leaky_bucket
Well, since this answer is really general, here's a very simple approach for plain TCP:
You put the resource handlers of all download connection into a list, paired up with information about what data is requested, and loop through it. Then you write a chunk of the required data onto the socket, maybe about 1.5K, which is the most commonly used maximum segment size, as far as I know. When you're at the and of the list, you start over. Before starting over, simply wait to get the desired average bandwidth.
Please note, if too many clients have lower bandwidth than you allow, then your TCP buffer is likely explode. some TCP bindings permit finding the size of currently buffered data for one socket. if it exceeds a threshold, you can simply skip the socket.
Also, if too many clients are connected, you will actually not have enough time to write to all the sockets, thus after one loop, you "have to wait for a negative time". Increasing the chunk size might speed up things in such scenarios, but at some point your server will stop getting faster.
A more simple approach is to do this on the client side, but this may generate a lot of overhead. The dead simple idea is to have the client request 1K every 50ms (assuming you want 20KB/s). You can even do that over HTTP, although I strongly suggest bigger chunk size, since HTTP has enourmous overheads.
My guess is, the best is to try to find a webserver capable of doing such things out of the box. I think Apache has a number of modules for al kinds of quota.
greetz
back2dos

Resources