4kb memory
1gb data
8 bytes per data
Sequential data output without writing to disk.
First, you can only accumulate and output a maximum of 4kb per time through the data. So you will need at least something like 250,000 passes.
How can you accumulate stuff in a pass?
In pseudocode the idea is like this.
while not done:
for each data (8 bytes) in dataset:
if this data has never been output:
if it might belong in the current batch:
add to current batch (evicting something else if needed)
if current_batch not empty:
sort current batch
emit current batch
update "never been output" filter
else:
done
What does that filter look like? It needs to know three things:
What is the maximum value so far emitted?
How many times has it been emitted?
How many times has it been seen on this pass?
Any value below the maximum value gets ignored. After you've seen the value enough times, you can add it to the current batch.
Now how about the current batch you're accumulating? That can be a heap that tells you the maximum value in the batch. If the heap is not full, or if the current value is below the maximum in the batch, you add it to the batch and lose the current max.
If the heap is arranged in memory so that the smallest is first, when the batch is done you can remove the max, which will free up the last slot (that's how heaps work), and put the max there. Keep doing that and you'll heapsort the batch. Now you can easily update the filter, and then emit the batch.
I don't think you can get significantly more efficient than this.
If I was asked this in an interview, I'd know the answer, but I'd also see being asked the question as a sign that the company's hiring process is suboptimal. This would make me less inclined to be hired there unless there was some purpose I could see to why they hired this way. (I know why FAANGs do. But at most companies I'd call it a red flag.)
I'm writing a program that simulates LRU caching, the program function should look to the cache to see whether the new input is inside it, if its already there just increment the number of hits and the frequency of the index that contains the element, if the input is not in the cache and the cache still have empty space, just push the input into the cache. if the cache has no space and the input is not inside it, get the index with the least frequency and push the input inside it, then reset the frequency of the index to 0. However I didn't get the expected result. After some analysis, I figured that I didn't handle the case of two indexes with the same minimum frequency. I have no idea how to handle that, do you?
I have a nosql database rows of two types:
Rows that are essentially counters with a high number of updates per second. It doesn't matter if these updates are done in a batch once every n seconds (where n is say 2 seconds).
Rows that contain tree-like structures, and each time the row is updated the tree structure has to be updated. Updating the tree structure each time is expensive, it would be better to do it as a batch job once every n seconds.
This is my plan and then I will explain the part I am struggling to execute and whether I need to move to something like RabbitMQ.
Each row has a unique id which I use as the key for redis. Redis can easily do loads of counter increments no problem. As for the tree structure, each update for the row can use the string append command to appen json instructions on how to modify the existing tree in the database.
This is the tricky part
I want to ensure each row gets updated every n seconds. There will be a large amount of redis keys getting updated.
This was my plan. Have three queues: pre-processing, processing, dead
By default every key is placed in the pre-processing queue when the command for a database update comes in. After exactly n seconds move each key/value which has been there for n seconds to the processing queue (don't know how to do this efficiently and concurrently). Now n seconds have passed, it doesn't matter which order the processing queue is done in and I can have any consumers racing through them. And I will have a dead queue in case tasks keep failing for some reason.
Is there a better way to do this? Is what I am thinking of possible?
Say I have an array which is initialized in the Master process (rank=0) and contains random integers.
I want to sum all its (the array) elements by a Slave process (rank=1) when the full array is only available to the Master process (meaning I can't just MPI_SEND the full array to the slave).
I know I can use schedule in order to divide the work between multiple threads, but I'm not sure how to do it without sending the whole array to the Slave process.
Also, I've been checking different clauses while trying to solve the problem and came across REDUCTION, I'm not sure exactly how it works.
Thanks!
What you want to do is indeed a reduction with sum as the operation. Here is how a reduction works: You have a collection of items and an operation you wish to perform that reduces them to a single item. For example, you want to sum every element in an array and end with a single number that is their sum.
To do this efficiently you divide your collection into equal sized chunks and distribute them to each participating process. Each process applies the operation to the elements in the collection until the process has a single value. In our running example, each process adds together its chunk of the array. Then half the processes send their results to another node which then applies the operation to the value it computed and the value it received. At this point only half the original processes are participating. We repeat this until one process has the final result.
Here is a link to a graphic that should make this a lot easier to understand: http://3.bp.blogspot.com/-ybPe3bJrpgc/UzCoG9BUFuI/AAAAAAAAB2U/Jz6UcwV_Urk/s1600/TreeStructure.JPG
Here is some MPI code for a reduction: https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_array.c
I have messages coming into my program with millisecond resolution (anywhere from zero to a couple hundred messages a millisecond).
I'd like to do some analysis. Specifically, I want to maintain multiple rolling windows of the message counts, updated as messages come in. For example,
# of messages in last second
# of messages in last minute
# of messages in last half-hour divided by # of messages in last hour
I can't just maintain a simple count like "1,017 messages in last second", since I won't know when a message is older than 1 second and therefore should no longer be in the count...
I thought of maintaining a queue of all the messages, searching for the youngest message that's older than one second, and inferring the count from the index. However, this seems like it would be too slow, and would eat up a lot of memory.
What can I do to keep track of these counts in my program so that I can efficiently get these values in real-time?
This is easiest handled by a cyclic buffer.
A cyclic buffer has a fixed number of elements, and a pointer to it. You can add an element to the buffer, and when you do, you increment the pointer to the next element. If you get past the fixed-length buffer you start from the beginning. It's a space and time efficient way to store "last N" items.
Now in your case you could have one cyclic buffer of 1,000 counters, each one counting the number of messages during one millisecond. Adding all the 1,000 counters gives you the total count during last second. Of course you can optimize the reporting part by incrementally updating the count, i.e. deduct form the count the number you overwrite when you insert and then add the new number.
You can then have another cyclic buffer that has 60 slots and counts the aggregate number of messages in whole seconds; once a second, you take the total count of the millisecond buffer and write the count to the buffer having resolution of seconds, etc.
Here C-like pseudocode:
int msecbuf[1000]; // initialized with zeroes
int secbuf[60]; // ditto
int msecptr = 0, secptr = 0;
int count = 0;
int msec_total_ctr = 0;
void msg_received() { count++; }
void every_msec() {
msec_total_ctr -= msecbuf[msecptr];
msecbuf[msecptr] = count;
msec_total_ctr += msecbuf[msecptr];
count = 0;
msecptr = (msecptr + 1) % 1000;
}
void every_sec() {
secbuf[secptr] = msec_total_ctr;
secptr = (secptr + 1) % 60;
}
You want exponential smoothing, otherwise known as an exponential weighted moving average. Take an EWMA of the time since the last message arrived, and then divide that time into a second. You can run several of these with different weights to cover effectively longer time intervals. Effectively, you're using an infinitely long window then, so you don't have to worry about expiring data; the reducing weights do it for you.
For the last millisecord, keep the count. When the millisecord slice goes to the next one, reset count and add count to a millisecond rolling buffer array. If you keep this cummulative, you can extract the # of messages / second with a fixed amount of memory.
When a 0,1 second slice (or some other small value next to 1 minute) is done, sum up last 0,1*1000 items from the rolling buffer array and place that in the next rolling buffer. This way you kan keep the millisecord rolling buffer small (1000 items for 1s max lookup) and the buffer for lookup the minute also (600 items).
You can do the next trick for whole minutes of 0,1 minutes intervals. All questions asked can be queried by summing (or when using cummulative , substracting two values) a few integers.
The only disadvantage is that the last sec value wil change every ms and the minute value only every 0,1 secand the hour value (and derivatives with the % in last 1/2 hour) every 0,1 minute. But at least you keep your memory usage at bay.
Your rolling display window can only update so fast, lets say you want to update it 10 times a second, so for 1 second's worth of data, you would need 10 values. Each value would contain the number of messages that showed up in that 1/10 of a second. Lets call these values bins, each bin holds 1/10 of a second's worth of data. Every 100 milliseconds, one of the bins gets discarded and a new bin is set to the number of messages that have show up in that 100 milliseconds.
You would need an array of 36K bins to hold an hour's worth information about your message rate if you wanted to preserve a precision of 1/10 of a second for the whole hour. But that seems overkill.
But I think it would be more reasonable to have the precision drop off as the time inteval gets larger.
Maybe you keep 1 second's worth of data accurate to 100 milliseconds, 1 minutes worth of data accurate to the second, 1 hour's worth of data accurate to the minute, and so on.
I thought of maintaining a queue of all the messages, searching for the youngest message that's older than one second, and inferring the count from the index. However, this seems like it would be too slow, and would eat up a lot of memory.
A better idea would be maintaining a linked list of the messages, adding new messages to the head (with a timestamp), and popping them from the tail as they expire. Or even not pop them - just keep a pointer to the oldest message that came in in the desired timeframe, and advance it towards the head when that message expires (this allows you to keep track of multiply timeframes with one list).
You could compute the count when needed by walking from the tail to the head, or just store the count separately, incrementing it whenever you add a value to the head, and decrementing it whenever you advance the tail.