what data structure to use for - data-structures

We have a messaging system where one module sends some messages to another remote module at a high rate. The receiving module decodes this message in a specific format and forwards it to two threads. One is called the logger thread and other is the forwarder thread.
Before we send this message to these threads we need to do some kind of grouping of these messages.
Please note that these messages are coming at a high rate approx 800 per second.
The alert structure is as follows:
INT type
INT Sending System ID
INT Recpt System ID
INT timestamp
INT codes
INT Source Port
INT Destination Port
Source IP Address (ipv4 or ipv6)
Destination IP Address (ipv4 or ipv6)
At the end of the match we need to maintain a structure with the following details
struct{
INT COUNT
INT First Alert Timestamp
INT Last Alert Timestamp
INT First Alert ID
INT Last Alert ID
}
For each alert which matches the 8 criterias, a group will be created/picked and the count will be incremented along with the other details.
The IP Address fields can be either a structure of 5 fields (INT Address Type, INT Address1, INT Address2, INT Address3 and INT Address 4) or it can be converted to string and then stored in the structure.
We have been rattling our heads for quite sometime but were unable to find a structure or algo efficient enough so that the memory and speed both can be addressed.
Hence thought of coming to you experts for help.

A double linked list to store the matched Alerts. Makes it easy to retrieve the first and last AlertID. You might will neeed to extend the double linked list to have a count field.
Depending on your performance requirements you could group the Alerts from a list with a hash on the identifiers. And if that isn't fast enough implement a more complex tree structure that groups by the identifying fields.
The best thing I can suggest is get it working in the most simple way possible, 800 per second is nothing. If you then have performance issues, then optimize. So much fun writing stuff like that using test driven development, beats the hell out of your average crud code!

What do you plan on writing this in? Any suggestion is going to depend heavily on the language.
Your best best is to start off with something like a Dictionary<string, ContainerObject> where the key consists of the needed parameters concatenated for quick lookups. Keep working with this dictionary in memory while you have another processes logging the values appropriately to say a DB or flat file.
Keep it simple and 800 a sec shouldn't be a problem. However the means of communication is going to be a major factor. Is this local or remote? if it's remote and coming from a single source your nemesis is going to be latency building up if it's done in individual requests.

Related

Grafana/Prometheus visualizing multiple ips as query

I want to have a graph where all recent IPs that requested my webserver get shown as total request count. Is something like this doable? Can I add a query and remove it afterwards via Prometheus?
Technically, yes. You will need to:
Expose some metric (probably a counter) in your server - say, requests_count, with a label; say, ip
Whenever you receive a request, inc the metric with the label set to the requester IP
In Grafana, graph the metric, likely summing it by the IP address to handle the case where you have several horizontally scaled servers handling requests sum(your_prometheus_namespace_requests_count) by (ip)
Set the Legend of the graph in Grafana to {{ ip }} to 'name' each line after the IP address it represents
However, every different label value a metric has causes a whole new metric to exist in the Prometheus time-series database; you can think of a metric like requests_count{ip="192.168.0.1"}=1 to be somewhat similar to requests_count_ip_192_168_0_1{}=1 in terms of how it consumes memory. Each metric instance currently being held in the Prometheus TSDB head takes something on the order of 3kB to exist. What that means is that if you're handling millions of requests, you're going to be swamping Prometheus' memory with gigabytes of data just from this one metric alone. A more detailed explanation about this issue exists in this other answer: https://stackoverflow.com/a/69167162/511258
With that in mind, this approach would make sense if you know for a fact you expect a small volume of IP addresses to connect (maybe on an internal intranet, or a client you distribute to a small number of known clients), but if you are planning to deploy to the web this would allow a very easy way for people to (unknowingly, most likely) crash your monitoring systems.
You may want to investigate an alternative -- for example, Grafana is capable of ingesting data from some common log aggregation platforms, so perhaps you can do some structured (e.g. JSON) logging, hold that in e.g. Elasticsearch, and then create a graph from the data held within that.

Create channels with extra flags in an idiomatic way

TL;DR I want to have the functionality where a channel has two extra fields that tell the producer whether it is allowed to send to the channel and if so tell the producer what value the consumer expects. Although I know how to do it with shared memory, I believe that this approach goes against Go's ideology of "Do not communicate by sharing memory; instead, share memory by communicating."
Context:
I wish to have a server S that runs (besides others) three goroutines:
Listener that just receives UDP packets and sends them to the demultplexer.
Demultiplexer that takes network packets and based on some data sends it into one of several channels
Processing task which listens to one specific channel and processes data received on that channel.
To check whether some devices on the network are still alive, the processing task will periodically send out nonces over the network and then wait for k seconds. In those k seconds, other participants of my protocol that received the nonce will send a reply containing (besides other information) the nonce. The demultiplexer will receive the packets from the listener, parse them and send them to the processing_channel. After the k seconds elapsed, the processing task processes the messages pushed onto the processing_channel by the demultiplexer.
I want the demultiplexer to not just blindly send any response (of the correct type) it received onto the the processing_channel, but to instead check whether the processing task is currently even expecting any messages and if so which nonce value it expects. I made this design decision in order to drop unwanted packets a soon as possible.
My approach:
In other languages, I would have a class with the following fields (in pseudocode):
class ActivatedChannel{
boolean flag_expecting_nonce;
int expected_nonce;
LinkedList chan;
}
The demultiplexer would then upon receiving a packet of the correct type simply acquire the lock for the ActivatedChannel processing_channel object, check whether the flag is set and the nonce matches, and if so add the message to the LinkedList chan!
Problem:
This approach makes use of locks and shared memory, which does not align with Golang's "Do not communicate by sharing memory; instead, share memory by communicating" mantra. Hence, I would like to know... :
... whether my approach is "bad" regarding Go in the sense that it relies on shared memory.
... how to achieve the outlined result in a more Go-like way.
Yes, the approach described by you doesn't align with Golang's Idiomatic way of implementation. And you have rightly pointed out that in the above approach you are communicating by sharing memory.
To achieve this in Go's Idiomatic way, one of the approaches could be that your Demultiplexer "remembers" all the processing_channels that are expecting nonce and the corresponding type of the nonce. Whenever a processing_channels is ready to receive a reply, it sends a signal to the Demultiplexe saying that it is expecting a reply.
Since Demultiplexer is at the center of all the communication it can maintain a mapping between a processing_channel & the corresponding nonce it expects. It can also maintain a "registry" of all the processing_channels which are expecting a reply.
In this approach, we are Sharing memory by communicating
For communicating that a processing_channel is expecting a reply, the following struct can be used:
type ChannelState struct {
ChannelId string // unique identifier for processing channel
IsExpectingNonce bool
ExpectedNonce int
}
In this approach, there is no lock used.

How to get the position of a destination node?

I have been working on a position-based protocol using veins-inet and I want to get the position of the destination node.
In my code, I got the IP Address of the destination from the datagram.
const L3Address& destAddr = datagram->getDestinationAddress();
and I want to get the current position of this node.
I already checked the following question
How to get RSU coordinate from TraCIDem11p.cc?
But it seems that it refers to the node by using the node ID.
Is there a way to get the position of the node by referring to its IP Address?
I am using instant veins-4.7.1
A very simple solution would be to have each node publish its current L3Address and Coord to a lookup table whenever it moves. This lookup table could be located in a shared module or every node could have its own lookup table. Remember, you are writing C++ code, so even a simple singleton class with methods for getting/setting information is enough to coordinate this.
If, however, the process of "a node figures out where another node is" is something you would like to model (e.g., this should be a process that takes some time, can fail, causes load on the wireless channel, ...) you would first need to decide how this information would be transferred in real life, then model this using messages exchanged between nodes.

WinHttpWriteData completion

I'm using WinHTTP to transfer large files to a PHP-based web server and I want to display the progress and an estimated speed. After reading the docs I have decided to use chunked transfer encoding. The files get transferred correctly but there is an issue with estimating the time that I cannot solve.
I'm using a loop to send chunks with WinHttpWriteData (header+trailer+footer) and I compute the time difference between start and finish with GetTickCount. I have a fixed bandwidth of 4mbit configured on my router in order to test the correctness of my estimation.
The typical time difference for chunks of 256KB is between 450 - 550ms, which is correct. The problem is that once in a while (few seconds/tens of seconds) WinHttpWriteData returns really really fast, like 4-10ms, which is obviously not possible. The next difference is much higher than the average 500ms.
Why does WinHttpWriteData confirms, either synchronously or asynchronously that it has written the data to the destination when, in reality, the data is still being transferred ? Any ideas ?
Oversimplified, my code looks like:
while (dataLeft)
{
t1 = GetTickCount();
WinHttpWriteData(hRequest, chunkHdr, chunkHdrLen , NULL);
waitWriteConfirm();
WinHttpWriteData(hRequest, actualData, actualDataLen , NULL);
waitWriteConfirm();
WinHttpWriteData(hRequest, chunkFtr, chunkFtrLen , NULL);
waitWriteConfirm();
t2 = GetTickCount();
tdif= t2 - t1;
}
This is simply the nature of how sockets work in general.
Whether you call a lower level function like send() or a higher level function like WinHttpWriteData(), the functions return success/failure based on whether they are able to pass data to the underlying socket kernel to not. The kernel queues up data for eventual transmission in the background. The kernel does not report back when the data is actually transmitted, or if the receiver acks the data. The kernel happily accepts new data as long as there is room in the queue, even if it will take awhile to actually transmit. Otherwise, it will block the sender until room becomes available in the queue.
If you need to monitor actual transmission speed, you have to monitor the low level network activity directly, such as with a packet sniffer or driver hook. Otherwise, you can only monitor how fast you are able to pass data to the kernel (which is usually good enough for most purposes).

Google App Engine: Message class using list properties for receivers

I have a message model and I want it to have several receivers, possibly a lot of them.
I would also like to be able to tell for each receiver if the message was viewed or not (read/unread). Also I would like a receiver to be able to delete the message.
The two possible solutions are the following, for each I have a Message model an User model.
For the first (using the ideas presented here http://www.google.com/events/io/2009/sessions/BuildingScalableComplexApps.html)
I have a MessageReceivers class which has a ListProperty containing the users that will receive the message and set the parent to the message. I query of this with messages = db.GqlQuery('SELECT __key__ FROM MessageReceivers WHERE receivers = :1', user) and the do a db.get([ key.parent() for key in messages ]).
The problem I have which this is that I'm not sure how to store the state of the message: whether it is read or not and a subsequent issue whether the user has new messages. An additional issue would be the overhead of deleting a message (would have to remove user from receivers list property)
For the second: I have a MessageReceiver for each receiver it has links to message and to user and also stores the state (read/unread).
Which of this two approached do you consider that it has a better performance? And in the case of the first do you have any suggestion on handling the status of the message.
I've implement first option in production. Drawback is that ListProperty is limited to 2500 entries if you use custom index. Shameless plug: See my blog bost http://bravenewmethod.wordpress.com/2011/03/23/developing-on-google-app-engine-for-production/
Read state storing. I did this by implementing an entity that stored unread messages up to few months back and then just assumed older ones read. Even simpler is to query the messages in date order, and store the last known message timestamp in entity and assume all older as read. I don't recommended keeping long history in entity with huge list property, because reading and storing such entities can get really slow.
Message deletion is expensive, no way around that.
If you need to store state per message, your best option is to write one entity per recipient, with read state (and anything else, such as flags, etcetera), rather than using the index relation pattern.

Resources