Sorting Items by a Cyclical Sequence Number - algorithm

I am developing an algorithm to reorder packets in a transmission. Each packet has an associated sequence number in [0, 256). The first packet's sequence number can take on any one of those values, after which the next packet takes the next value, and the next packet the value after that, and so forth (rolling over after 255).
The sequence numbers of the packets, in the correct order, would appear as follows, where "n" is the first packet's sequence number:
n, n+1, n+2, ..., 254, 255, 0, 1, 2, ..., 254, 255, 0, 1, 2, ..., 254, 255, 0, 1, ...
Each packet is given a timestamp when it arrives at its destination, and they all arrive approximately in order. (I don't have an exact figure, but given a list of packets sorted by arrival timestamp, it is safe to say that a packet will never be more than five spots away from its position in the list indicated by its sequence number.)
I feel that I cannot have been the first person to deal with a problem like this, given the prevalence of telecommunications and its historical importance to the development of computer science. My question, then:
Is there a well-known algorithm to reorder an approximately-ordered sequence, such as the one described above, given a cyclically-changing key?
Is there a variation of this algorithm that is tolerant of large chunks of missing items? Let us assume that these chunks can be of any length. I am specifically worried about chunks of 256 or more missing items.
I have a few ideas for algorithms for the first, but not for the second. Before I invest the man-hours to verify that my algorithms are correct, however, I wanted to make sure that somebody at Bell Labs (or anywhere else) hadn't already done this better thirty years ago.

I don't know if this solution is actually used anywhere, but here is what I would try (assuming no missing packets, a maximum "shuffeling" of five positions, and a maximum sequence number of 255):
n = 0;
max_heap h = empty;
while( true ) do
while( h.top().index != 0 ) do
p = next_packet;
i = n - p.seq;
if( i > 0 ) i = i - 255;
h.add( i, p );
done
p = h.pop();
n = n + 1;
p.increase_indexes( 1 );
// Do something with it
done
Basically in the priority queue we store how many packets there are between the last handled packet and the packets still waiting to be handled. The queue will stay very small, because packets are handled as soon as they can, when they come in. Also increasing the keys will be very simple, since no reordering of the heap is necessary.
I am not sure how you could adapt this to missing packets. Most likely by using some timeout, or maximum offset, after which the packtets are declared the "next" and the heap is updated accordingly.
I do not think this problem is possible at all however, if you miss more than 256 packets. Take the subsequences
127,130,128,129
There could be several causes for this
1) Packets 128 and 129 were out of order and should be reordered
2) Packets 128 and 129 were lost, then 253 packtets were lost, so the order is correct
3) A mixture of 1 and 2

Interesting problem!
My solution would be sort the packets according to time of arrival and locally sorting a window of elements (say 10) circularly according to their packet number. You can refine this in many ways. If difference between two consecutive packet numbers (arranged according to time of arrival) is greater than certain threshold you might put a barrier between them (i.e. you cannot sort across barriers). Also, if time difference between packets (arranged according to time of arrival) is greater than some threshold you might want to put a barrier (this should probably take care of problem 2).

Use a priority queue.
After each receiving each packet:
put it in the queue.
repeatedly remove the top element of the queue as long as it is the one you're waiting for.
For the second question:
In general, no, there's no way to solve it.
If the packet arrival has some periodicity (eg: expecting a packet in every 20ms), then you can easily detect this, clean up the queue, and after receiving 5 packets, you'll know how to process again...

Related

Get all possible valid positions of ships in battleship game

I'm creating probability assistant for Battleship game - in essence, for given game state (field state and available ships), it would produce field where all free cells will have probability of hit.
My current approach is to do a monte-carlo like computation - get random free cell, get random ship, get random ship rotation, check if this placement is valid, if so continue with next ship from available set. If available set is empty, add how the ships were set to output stack. Redo this multiple times, use outputs to compute probability of each cell.
Is there sane algorithm to process all possible ship placements for given field state?
An exact solution is possible. But does not qualify as sane in my books.
Still, here is the idea.
There are many variants of the game, but let's say that we start with a worst case scenario of 1 ship of size 5, 2 of size 4, 3 of size 3 and 4 of size 2.
The "discovered state" of the board is all spots where shots have been taken, or ships have been discovered, plus the number of remaining ships. The discovered state naively requires 100 bits for the board (10x10, any can be shot) plus 1 bit for the count of remaining ships of size 5, 2 bits for the remaining ships of size 4, 2 bits for remaining ships of size 3 and 3 bits for remaining ships of size 2. This makes 108 bits, which fits in 14 bytes.
Now conceptually the idea is to figure out the map by shooting each square in turn in the first row, the second row, and so on, and recording the game state along with transitions. We can record the forward transitions and counts to find how many ways there are to get to any state.
Then find the end state of everything finished and all ships used and walk the transitions backwards to find how many ways there are to get from any state to the end state.
Now walk the data structure forward, knowing the probability of arriving at any state while on the way to the end, but this time we can figure out the probability of each way of finding a ship on each square as we go forward. Sum those and we have our probability heatmap.
Is this doable? In memory, no. In a distributed system it might be though.
Remember that I said that recording a state took 14 bytes? Adding a count to that takes another 8 bytes which takes us to 22 bytes. Adding the reverse count takes us to 30 bytes. My back of the envelope estimate is that at any point in our path there are on the order of a half-billion states we might be in with various ships left, killed ships sticking out and so on. That's 15 GB of data. Potentially for each of 100 squares. Which is 1.5 terabytes of data. Which we have to process in 3 passes.

efficient way to detect periodicity of network flows

I have lots of netflow data (i.e src_ip, dest_ip, beg_time, end_time, data_size, etc) and some of them are happening periodically that I want to find out.
Consider I have n netflow(maybe around 10^6) and m of them are periodic. How could I find which ones are periodic?
I can write a code but it will be at least O(n^3 logn), which will take forever for after 10^4 number of netflow.
I have searched about it but couldn't find anything.
Note: You can consider data is sorted according to start time and start time is 32 bit unsigned int(uint32 in c++)
Correction: src_ip is unique and dest_ip is not unique, time for periodicity is unknown. It may be 5 min or it may be 5 days. You can forget about src_ip, dest_ip, end_time, data_size and other attributes of flow. I'm only looking for events whose beginning times are periodic and you can consider, I have eleminated events which are unrelated like different src_ip's, and so on...
Any help will be appreciated,
Thanks
I'd try computing FFT on signals corresponding to your data.
For example, I'd transform the chunk beg_time=1, end_time=5, data_size=100 into a square pulse from 1 to 5 units of time with the amplitude 100.
If you want analyze everything together, you superimpose all the pulses you've got.
If it doesn't make sense to put everything together, superimpose only the pulses from the same src_ip or from the same pair of src_ip and dst_ip.
And then run the FFT on those signals obtained through superposition and see if there any noticeable peaks in the frequency domain, or it all looks randomish, no outstanding peaks.
FFT runs in O(n*log(n)) time, where n is the number of signal samples.
I'm sure there must be better ways to do it, but it may be worth a try.

Algorithm to find middle of largest free time slot in period?

Say I want to schedule a collection of events in the period 00:00–00:59. I schedule them on full minutes (00:01, never 00:01:30).
I want to space them out as far apart as possible within that period, but I don't know in advance how many events I will have total within that hour. I may schedule one event today, then two more tomorrow.
I have the obvious algorithm in my head, and I can think of brute-force ways to implement it, but I'm sure someone knows a nicer way. I'd prefer Ruby or something I can translate to Ruby, but I'll take what I can get.
So the algorithm I can think of in my head:
Event 1 just ends up at 00:00.
Event 2 ends up at 00:30 because that time is the furthest from existing events.
Event 3 could end up at either 00:15 or 00:45. So perhaps I just pick the first one, 00:15.
Event 4 then ends up in 00:45.
Event 5 ends up somewhere around 00:08 (rounded up from 00:07:30).
And so on.
So we could look at each pair of taken minutes (say, 00:00–00:15, 00:15–00:30, 00:30–00:00), pick the largest range (00:30–00:00), divide it by two and round.
But I'm sure it can be done much nicer. Do share!
You can use bit reversing to schedule your events. Just take the binary representation of your event's sequential number, reverse its bits, then scale the result to given range (0..59 minutes).
An alternative is to generate the bit-reversed words in order (0000,1000,0100,1100,...).
This allows to distribute up to 32 events easily. If more events are needed, after scaling the result you should check if the resulting minute is already occupied, and if so, generate and scale next word.
Here is the example in Ruby:
class Scheduler
def initialize
#word = 0
end
def next_slot
bit = 32
while (((#word ^= bit) & bit) == 0) do
bit >>= 1;
end
end
def schedule
(#word * 60) / 64
end
end
scheduler = Scheduler.new
20.times do
p scheduler.schedule
scheduler.next_slot
end
Method of generating bit-reversed words in order is borrowed from "Matters Computational
", chapter 1.14.3.
Update:
Due to scaling from 0..63 to 0..59 this algorithm tends to make smallest slots just after 0, 15, 30, and 45. The problem is: it always starts filling intervals from these (smallest) slots, while it is more natural to start filling from largest slots. Algorithm is not perfect because of this. Additional problem is the need to check for "already occupied minute".
Fortunately, a small fix removes all these problems. Just change
while (((#word ^= bit) & bit) == 0) do
to
while (((#word ^= bit) & bit) != 0) do
and initialize #word with 63 (or keep initializing it with 0, but do one iteration to get the first event). This fix decrements the reversed word from 63 to zero, it always distributes events to largest possible slots, and allows no "conflicting" events for the first 60 iteration.
Other algorithm
The previous approach is simple, but it only guarantees that (at any moment) the largest empty slots are no more than twice as large as the smallest slots. Since you want to space events as far apart as possible, algorithm, based on Fibonacci numbers or on Golden ratio, may be preferred:
Place initial interval (0..59) to the priority queue (max-heap, priority = interval size).
To schedule an event, pop the priority queue, split the resulting interval in golden proportion (1.618), use split point as the time for this event, and put two resulting intervals back to the priority queue.
This guarantees that the largest empty slots are no more than (approximately) 1.618 times as large as the smallest slots. For smaller slots approximation worsens and sizes are related as 2:1.
If it is not convenient to keep the priority queue between schedule changes, you can prepare an array of 60 possible events in advance, and extract next value from this array every time you need a new event.
Since you can have only 60 events at maximum to schedule, then I suppose using static table is worth a shot (compared to thinking algorithm and testing it). I mean for you it is quite trivial task to layout events within time. But it is not so easy to tell computer how to do it nice way.
So, what I propose is to define table with static values of time at which to put next event. It could be something like:
00:00, 01:00, 00:30, 00:15, 00:45...
Since you can't reschedule events and you don't know in advance how many events will arrive, I suspect your own proposal (with Roman's note of using 01:00) is the best.
However, if you have any sort of estimation on how many events will arrive at maximum, you can probably optimize it. For example, suppose you are estimating at most 7 events, you can prepare slots of 60 / (n - 1) = 10 minutes and schedule the events like this:
00:00
01:00
00:30
00:10
00:40
00:20
00:50 // 10 minutes apart
Note that the last few events might not arrive and so 00:50 has a low probability to be used.
which would be fairer then the non-estimation based algorithm, especially in the worst-case scenario were all slots are used:
00:00
01:00
00:30
00:15
00:45
00:07
00:37 // Only 7 minutes apart
I wrote a Ruby implementation of my solution. It has the edge case that any events beyond 60 will all stack up at minute 0, because every free space of time is now the same size, and it prefers the first one.
I didn't specify how to handle events beyond 60, and I don't really care, but I suppose randomization or round-robin could solve that edge case if you do care.
each_cons(2) gets bigrams; the rest is probably straightforward:
class Scheduler
def initialize
#scheduled_minutes = []
end
def next_slot
if #scheduled_minutes.empty?
slot = 0
else
circle = #scheduled_minutes + [#scheduled_minutes.first + 60]
slot = 0
largest_known_distance = 0
circle.each_cons(2) do |(from, unto)|
distance = (from - unto).abs
if distance > largest_known_distance
largest_known_distance = distance
slot = (from + distance/2) % 60
end
end
end
#scheduled_minutes << slot
#scheduled_minutes.sort!
slot
end
def schedule
#scheduled_minutes
end
end
scheduler = Scheduler.new
20.times do
scheduler.next_slot
p scheduler.schedule
end

Find the count of a particular number in an infinite stream of numbers at a particular moment

I faced this problem in a recent interview:
You have a stream of incoming numbers in range 0 to 60000 and you have a function which will take a number from that range and return the count of occurrence of that number till that moment. Give a suitable Data structure/algorithm to implement this system.
My solution is:
Make an array of size 60001 pointing to bit-vectors. These bit vectors will contain the count of the incoming numbers and the incoming numbers will also be used to index into the array for the corresponding number. Bit-vectors will dynamically increase as the count gets too big to hold in them.
So, if the numbers are coming at rate 100numbers/sec then, in 1million years total numbers will be = (100*3600*24)*365*1000000 = 3.2*10^15. In the worst case where all numbers in the stream is same it will take ceil((log(3.2*10^15) / log 2) )= 52bits and if the numbers are uniformly distributed the we will have (3.2*10^15) / 60001 = 5.33*10^10 number of occurrences for each number which will require total of 36 bits for each numbers.
So, assuming 4byte pointers we need (60001 * 4)/1024 = 234 KB memory for the array and for the case with same numbers, we need bit vector size = 52/8 = 7.5 bytes which is still around 234KB. And for the other case we need (60001 * 36 / 8)/1024 = 263.7 KB for bit vector totaling about 500KB. So, it is very much feasible to do this with ordinary PC and memory.
But the interviewer said, as it is infinite stream it will eventually overflow and gave me hint like how can we do this if there were many PCs and we could pass messages between them or think about file system etc. But I kept thinking if this solution was not working then, others would too. Needless to say, I did not get the job.
How to do this problem with less memory? Can you think of an alternative approach (using network of PCs may be)?
A formal model for the problem could be the following.
We want to know if it exists a constant space bounded Turing machine such that, in any given time it recognizes the language L of all couples (number,number of occurrences so far). This means that all correct couples will be accepted and all incorrect couples will be rejected.
As a corollary of the Theorem 3.13 in Hopcroft-Ullman we know that every language recognized by a constant space bounded machine is regular.
It can be proven by using the pumping lemma for regular languages that the language described above is not a regular language. So you can't recognize it with a constant space bounded machine.
you can easily use index based search, by using an array like int arr[60000][1], whenever you get a number , say 5000, directly access the index( num-1) = (5000-1) as, arr[num-1][1], and increment the number, and now whenever u want to know how many times a particular num has ocurred you can just access it by arr[num-1][1] and you'll get the count for that number, Its simplest possible linear time implementation.
Isn't this External Sorting? Store the infinite stream in a file. Do a seek() (RandomAccessFile.seek() in Java) in the file and get to the appropriate timestamp. This is similar to Binary Search since the data is sorted by timestamps. Once you get to the appropriate timestamp, the problem turns into counting a particular number from an infinite set of numbers. Here, instead of doing a quick sort in memory, Counting sort can be done since the range of numbers is limited.

About random number sequence generation

I am new to randomized algorithms, and learning it myself by reading books. I am reading a book Data structures and Algorithm Analysis by Mark Allen Wessis
.
Suppose we only need to flip a coin; thus, we must generate a 0 or 1
randomly. One way to do this is to examine the system clock. The clock
might record time as an integer that counts the number of seconds
since January 1, 1970 (atleast on Unix System). We could then use the
lowest bit. The problem is that this does not work well if a sequence
of random numbers is needed. One second is a long time, and the clock
might not change at all while the program is running. Even if the time
were recorded in units of microseconds, if the program were running by
itself the sequence of numbers that would be generated would be far
from random, since the time between calls to the generator would be
essentially identical on every program invocation. We see, then, that
what is really needed is a sequence of random numbers. These numbers
should appear independent. If a coin is flipped and heads appears,
the next coin flip should still be equally likely to come up heads or
tails.
Following are question on above text snippet.
In above text snippet " for count number of seconds we could use lowest bit", author is mentioning that this does not work as one second is a long time,
and clock might not change at all", my question is that why one second is long time and clock will change every second, and in what context author is mentioning
that clock does not change? Request to help to understand with simple example.
How author is mentioning that even for microseconds we don't get sequence of random numbers?
Thanks!
Programs using random (or in this case pseudo-random) numbers usually need plenty of them in a short time. That's one reason why simply using the clock doesn't really work, because The system clock doesn't update as fast as your code is requesting new numbers, therefore qui're quite likely to get the same results over and over again until the clock changes. It's probably more noticeable on Unix systems where the usual method of getting the time only gives you second accuracy. And not even microseconds really help as computers are way faster than that by now.
The second problem you want to avoid is linear dependency of pseudo-random values. Imagine you want to place a number of dots in a square, randomly. You'll pick an x and a y coordinate. If your pseudo-random values are a simple linear sequence (like what you'd obtain naïvely from a clock) you'd get a diagonal line with many points clumped together in the same place. That doesn't really work.
One of the simplest types of pseudo-random number generators, the Linear Congruental Generator has a similar problem, even though it's not so readily apparent at first sight. Due to the very simple formula
you'll still get quite predictable results, albeit only if you pick points in 3D space, as all numbers lies on a number of distinct planes (a problem all pseudo-random generators exhibit at a certain dimension):
Computers are fast. I'm over simplifying, but if your clock speed is measured in GHz, it can do billions of operations in 1 second. Relatively speaking, 1 second is an eternity, so it is possible it does not change.
If your program is doing regular operation, it is not guaranteed to sample the clock at a random time. Therefore, you don't get a random number.
Don't forget that for a computer, a single second can be 'an eternity'. Programs / algorithms are often executed in a matter of milliseconds. (1000ths of a second. )
The following pseudocode:
for(int i = 0; i < 1000; i++)
n = rand(0, 1000)
fills n a thousand times with a random number between 0 and 1000. On a typical machine, this script executes almost immediatly.
While you typically only initialize the seed at the beginning:
The following pseudocode:
srand(time());
for(int i = 0; i < 1000; i++)
n = rand(0, 1000)
initializes the seed once and then executes the code, generating a seemingly random set of numbers. The problem arises then, when you execute the code multiple times. Lets say the code executes in 3 milliseconds. Then the code executes again in 3 millisecnds, but both in the same second. The result is then a same set of numbers.
For the second point: The author probabaly assumes a FAST computer. THe above problem still holds...
He means by that is you are not able to control how fast your computer or any other computer runs your code. So if you suggest 1 second for execution thats far from anything. If you try to run code by yourself you will see that this is executed in milliseconds so even that is not enough to ensure you got random numbers !

Resources