Data structure to achieve random delete and insert where elements are weighted in [a,b] - algorithm

I would like to design a data structure and algorithm such that, given an array of elements, where each element has a weight according to [a,b], I can achieve constant time insertion and deletion. The deletion is performed randomly where the probability of an element being deleted is proportional to its weight.
I do not believe there is a deterministic algorithm that can achieve both operations in constant time, but I think there are there randomized algorithms that should be can accomplish this?

I don't know if O(1) worst-case time is impossible; I don't see any particular reason it should be. But it's definitely possible to have a simple data structure which achieves O(1) expected time.
The idea is to store a dynamic array of pairs (or two parallel arrays), where each item is paired with its weight; insertion is done by appending in O(1) amortised time, and an element can be removed by index by swapping it with the last element so that it can be removed from the end of the array in O(1) time. To sample a random element from the weighted distribution, choose a random index and generate a random number in the half-open interval [0, 2); if it is less than the element's weight, select the element at that index, otherwise repeat this process until an element is selected. The idea is that each index is equally likely to be chosen, and the probability it gets kept rather than rejected is proportional to its weight.
This is a Las Vegas algorithm, meaning it is expected to complete in a finite time, but with very low probability it can take arbitrarily long to complete. The number of iterations required to sample an element will be highest when every weight is exactly 1, in which case it follows a geometric distribution with parameter p = 1/2, so its expected value is 2, a constant which is independent of the number of elements in the data structure.
In general, if all weights are in an interval [a, b] for real numbers 0 < a <= b, then the expected number of iterations is at most b/a. This is always a constant, but it is potentially a large constant (i.e. it takes many iterations to select a single sample) if the lower bound a is small relative to b.

This is not an answer per se, but just a tiny example to illustrate the algorithm devised by #kaya3
| value | weight |
| v1 | 1.0 |
| v2 | 1.5 |
| v3 | 1.5 |
| v4 | 2.0 |
| v5 | 1.0 |
| total | 7.0 |
The total weight is 7.0. It's easy to maintain in O(1) by storing it in some memory and increasing/decreasing at each insertion/removal.
The probability of each element is simply it's weight divided by total weight.
| value | proba |
| v1 | 1.0/7 | 0.1428...
| v2 | 1.5/7 | 0.2142...
| v3 | 1.5/7 | 0.2142...
| v4 | 2.0/7 | 0.2857...
| v5 | 1.0/7 | 0.1428...
Using the algorithm of #kaya3, if we draw a random index, then the probability of each value is 1/size (1/5 here).
The chance of being rejected is 50% for v1, 25% for v2 and 0% for v4. So at first round, the probability to be selected are:
| value | proba |
| v1 | 2/20 | 0.10
| v2 | 3/20 | 0.15
| v3 | 3/20 | 0.15
| v4 | 4/20 | 0.20
| v5 | 2/20 | 0.10
| total | 14/20 | (70%)
Then the proba of having a 2nd round is 30%, and the proba of each index is 6/20/5 = 3/50
| value | proba 2 rounds |
| v1 | 2/20 + 6/200 | 0.130
| v2 | 3/20 + 9/200 | 0.195
| v3 | 3/20 + 9/200 | 0.195
| v4 | 4/20 + 12/200 | 0.260
| v5 | 2/20 + 6/200 | 0.130
| total | 14/20 + 42/200 | (91%)
The proba to have a 3rd round is 9%, that is 9/500 for each index
| value | proba 3 rounds |
| v1 | 2/20 + 6/200 + 18/2000 | 0.1390
| v2 | 3/20 + 9/200 + 27/2000 | 0.2085
| v3 | 3/20 + 9/200 + 27/2000 | 0.2085
| v4 | 4/20 + 12/200 + 36/2000 | 0.2780
| v5 | 2/20 + 6/200 + 18/2000 | 0.1390
| total | 14/20 + 42/200 + 126/2000 | (97,3%)
So we see that the serie is converging to the correct probabilities. The numerators are multiple of the weight, so it's clear that the relative weight of each element is respected.

This is a sketch of an answer.
With weights only 1, we can maintain a random permutation of the inputs.
Each time an element is inserted, put it at the end of the array, then pick a random position i in the array, and swap the last element with the element at position i.
(It may well be a no-op if the random position turns out to be the last one.)
When deleting, just delete the last element.
Assuming we can use a dynamic array with O(1) (worst case or amortized) insertion and deletion, this does both insertion and deletion in O(1).
With weights 1 and 2, the similar structure may be used.
Perhaps each element of weight 2 should be put twice instead of once.
Perhaps when an element of weight 2 is deleted, its other copy should also be deleted.
So we should in fact store indices instead of the elements, and another array, locations, which stores and tracks the two indices for each element. The swaps should keep this locations array up-to-date.
Deleting an arbitrary element can be done in O(1) similarly to inserting: swap with the last one, delete the last one.

Related

Proficient method to use adjacency list and matrix for graphsin golang?

I want to practice about graphs in golang using adjacency list and ajdacency matrix. For that user inputs number of nodes (n) and edges (e).
In golang for matrix, i cannot dynamically initialise the size of it as (var arr[e][e] int). I can only do it as (var arr[6][6]) It says an error that matrix size has to be constant. How can i implement this?
Another question is that of adjaceny list. i have to store at each position in array, an object that this node is connected to. Example:
0 | 1 | 2 | 3 | 4 | 5 |
{} | {2,3} | {1,4} | {1,4,5} | {2,3,5} | {3,4} |
Which data structure should I use for this?
I hope I was able to convey my problem

Minimum cost within limited time for a timetable?

I have a timetable like this:
+-----------+-------------+------------+------------+------------+------------+-------+----+
| transport | trainnumber | departcity | arrivecity | departtime | arrivetime | price | id |
+-----------+-------------+------------+------------+------------+------------+-------+----+
| Q | Q00 | BJ | TJ | 13:00:00 | 15:00:00 | 10 | 1 |
| Q | Q01 | BJ | TJ | 18:00:00 | 20:00:00 | 10 | 2 |
| Q | Q02 | TJ | BJ | 16:00:00 | 18:00:00 | 10 | 3 |
| Q | Q03 | TJ | BJ | 21:00:00 | 23:00:00 | 10 | 4 |
| Q | Q04 | HA | DL | 06:00:00 | 11:00:00 | 50 | 5 |
| Q | Q05 | HA | DL | 14:00:00 | 19:00:00 | 50 | 6 |
| Q | Q06 | HA | DL | 18:00:00 | 23:00:00 | 50 | 7 |
| Q | Q07 | DL | HA | 07:00:00 | 12:00:00 | 50 | 8 |
| Q | Q08 | DL | HA | 15:00:00 | 20:00:00 | 50 | 9 |
| ... | ... | ... | ... | ... | ... | ... | ...|
+-----------+-------------+------------+------------+------------+------------+-------+----+
In this table, there 13 cities and 116 routes altogether and the smallest unit of time is half an hour.
There are difference transports, which doesn't matter. As you can see, there can be multiple edges with same departcity and arrivecity but difference time and difference price. The time is constant everyday.
Now, here arises a problem.
A user wonder how he can travel from city A to city B (A and B may be one city), with passing zero or some cities C, D...(whether they should be in order depends on whether the user wants it to be, that is, there are two problems), within X hours and also least costs under above conditions.
Before this problem, I have solved another simpler problem.
A user wonder how he can travel from city A to city B (A and B may be one city), with passing zero or some cities C, D...(whether they should be in order depends), with least costs under above conditions.
Here is how I solve it (just take not in order as an example):
Sort the must-pass cities:C1, C2, C3...Cn. Let C0 = A, C(n+1) = B, minCost.cost = INFINITE;
i = 0, j = 1, W = {};
Find a least cost way S from Ci to Cj using Dijkstra Algorithm with price as the weight of edges. W=W∪S;
i = i + 1, j = j + 1;
If j <= n + 1, goto 3;
if W.cost < minCost.cost, minCost = W;
If next permutation for C1...Cn exists, rearrange list C1...Cn in order of the next permutation for C1...Cn and goto 2;
Return minCost;
However, I cannot come up with a efficient solution to the first problem, Please help me, thanks.
I'll be appreciated if anyone can solve another problem:
A user wonder how he can travel from city A to city B (A and B may be one city), with passing zero or some cities C, D...(whether they should be in order depends), within least time under above conditions.
It's quite a big problem, so I will just sketch a solution.
First, remodel your graph as follows. Instead of each vertex representing a city, let a vertex represent a tuple of (city, time). This is feasible as there are only 13 cities and only (time_limit - current_time) * 2 possible time slots as the smallest unit of time is half an hour. Now connect vertices according to the given timetable with prices as their weights as before. Don't forget that the user can stay at any city for any amount of time for free. All nodes with city A are start nodes, all nodes with city B are target nodes. Take the minimum value of all (B, time) vertices to get the solution with least cost. If there are multiple, take the one with the smallest time.
Now on towards forcing the user to pass through certain cities in order. If there are n cities to pass through (plus start and target city), you need n+2 copies of the same graph which act as different levels. The level represents how many cities of your list you have already passed. So you start in level 0 on vertex A. Once you get to C1 in level 0 you move to the vertex C1 in level 1 of the graph (connect the vertices by 0-weight edges). This means that when you are in level k, you have already passed cities C1 to Ck and you can get to the next level only by going through C(k+1). The vertices of city B in the last level are your target nodes.
Note: I said copies of the same graph, but that is not exactly true. You can't allow the user to reach C(k+2), ..., B in level k, that would violate the required order.
To enforce passing cities in any order, a different scheme of connecting the levels (and modifying them during runtime) is required. I'll leave this to you.

Kernel density estimation julia

I am trying to implement a kernel density estimation. However my code does not provide the answer it should. It is also written in julia but the code should be self explanatory.
Here is the algorithm:
where
So the algorithm tests whether the distance between x and an observation X_i weighted by some constant factor (the binwidth) is less then one. If so, it assigns 0.5 / (n * h) to that value, where n = #of observations.
Here is my implementation:
#Kernel density function.
#Purpose: estimate the probability density function (pdf)
#of given observations
##param data: observations for which the pdf should be estimated
##return: returns an array with the estimated densities
function kernelDensity(data)
|
| #Uniform kernel function.
| ##param x: Current x value
| ##param X_i: x value of observation i
| ##param width: binwidth
| ##return: Returns 1 if the absolute distance from
| #x(current) to x(observation) weighted by the binwidth
| #is less then 1. Else it returns 0.
|
| function uniformKernel(x, observation, width)
| | u = ( x - observation ) / width
| | abs ( u ) <= 1 ? 1 : 0
| end
|
| #number of observations in the data set
| n = length(data)
|
| #binwidth (set arbitraily to 0.1
| h = 0.1
|
| #vector that stored the pdf
| res = zeros( Real, n )
|
| #counter variable for the loop
| counter = 0
|
| #lower and upper limit of the x axis
| start = floor(minimum(data))
| stop = ceil (maximum(data))
|
| #main loop
| ##linspace: divides the space from start to stop in n
| #equally spaced intervalls
| for x in linspace(start, stop, n)
| | counter += 1
| | for observation in data
| | |
| | | #count all observations for which the kernel
| | | #returns 1 and mult by 0.5 because the
| | | #kernel computed the absolute difference which can be
| | | #either positive or negative
| | | res[counter] += 0.5 * uniformKernel(x, observation, h)
| | end
| | #devide by n times h
| | res[counter] /= n * h
| end
| #return results
| res
end
#run function
##rand: generates 10 uniform random numbers between 0 and 1
kernelDensity(rand(10))
and this is being returned:
> 0.0
> 1.5
> 2.5
> 1.0
> 1.5
> 1.0
> 0.0
> 0.5
> 0.5
> 0.0
the sum of which is: 8.5 (The cumulative distibution function. Should be 1.)
So there are two bugs:
The values are not properly scaled. Each number should be around one tenth of their current values. In fact, if the number of observation increases by 10^n n = 1, 2, ... then the cdf also increases by 10^n
For example:
> kernelDensity(rand(1000))
> 953.53
They don't sum up to 10 (or one if it were not for the scaling error). The error becomes more evident as the sample size increases: there are approx. 5% of the observations not being included.
I believe that I implemented the formula 1:1, hence I really don't understand where the error is.
I'm not an expert on KDEs, so take all of this with a grain of salt, but a very similar (but much faster!) implementation of your code would be:
function kernelDensity{T<:AbstractFloat}(data::Vector{T}, h::T)
res = similar(data)
lb = minimum(data); ub = maximum(data)
for (i,x) in enumerate(linspace(lb, ub, size(data,1)))
for obs in data
res[i] += abs((obs-x)/h) <= 1. ? 0.5 : 0.
end
res[i] /= (n*h)
end
sum(res)
end
If I'm not mistaken, the density estimate should integrate to 1, that is we would expect kernelDensity(rand(100), 0.1)/100 to get at least close to 1. In the implementation above I'm getting there, give or take 5%, but then again we don't know that 0.1 is the optimal bandwith (using h=0.135 instead I'm getting there to within 0.1%), and the uniform Kernel is known to only be about 93% "efficient".
In any case, there's a very good Kernel Density package in Julia available here, so you probably should just do Pkg.add("KernelDensity") instead of trying to code your own Epanechnikov kernel :)
To point out the mistake: You have n bins B_i of size 2h covering [0,1], a random point X lands in expected number of bins. You divide by 2 n h.
For n points, the expected value of your function is .
Actually, you have some bins of size < 2h. (for example if start = 0, half of first the bin is outside of [0,1]), factoring this in gives the bias.
Edit: Btw, the bias is easy to calculate if you assume that the bins have random locations in [0,1]. Then the bins are on average missing h/2 = 5% of their size.

Intersection ranges (algorithm)

As example I have next arrays:
[100,192]
[235,280]
[129,267]
As intersect arrays we get:
[129,192]
[235,267]
Simple exercise for people but problem for creating algorithm that find second multidim array…
Any language, any ideas..
If somebody do not understand me:
I'll assume you wish to output any range that has 2 or more overlapping intervals.
So the output for [1,5], [2,4], [3,3] will be (only) [2,4].
The basic idea here is to use a sweep-line algorithm.
Split the ranges into start- and end-points.
Sort the points.
Now iterate through the points with a counter variable initialized to 0.
If you get a start-point:
Increase the counter.
If the counter's value is now 2, record that point as the start-point for a range in the output.
If you get an end-point
Decrease the counter.
If the counter's value is 1, record that point as the end-point for a range in the output.
Note:
If a start-point and an end-point have the same value, you'll need to process the end-point first if the counter is 1 and the start-point first if the counter is 2 or greater, otherwise you'll end up with a 0-size range or a 0-size gap between two ranges in the output.
This should be fairly simple to do by having a set of the following structure:
Element
int startCount
int endCount
int value
Then you combine all points with the same value into one such element, setting the counts appropriately.
Running time:
O(n log n)
Example:
Input:
[100, 192]
[235, 280]
[129, 267]
(S for start, E for end)
Points | | 100 | 129 | 192 | 235 | 267 | 280 |
Type | | Start | Start | End | Start | End | End |
Count | 0 | 1 | 2 | 1 | 2 | 1 | 0 |
Output | | | [129, | 192] | [235, | 267] | |
This is python implementation of intersection algorithm. Its computcomputational complexity O(n^2).
a = [[100,192],[235,280],[129,267]]
def get_intersections(diapasons):
intersections = []
for d in diapasons:
for check in diapasons:
if d == check:
continue
if d[0] >= check[0] and d[0] <= check[1]:
right = d[1]
if check[1] < d[1]:
right = check[1]
intersections.append([d[0], right])
return intersections
print get_intersections(a)

insert, delete, max in O(1)

Can someone tell me which data structure supports insert/delete/maximum operation in O(1)?
This is a classical interview question, and is usually presented like this:
Devise a stack-like data structure that does push, pop and min (or max) operations in O(1) time. There are no space constraints.
The answer is, you use two stacks: the main stack, and a min (or max) stack.
So for example, after pushing 1,2,3,4,5 onto the stack, your stacks would look like this:
MAIN MIN
+---+ +---+
| 5 | | 1 |
| 4 | | 1 |
| 3 | | 1 |
| 2 | | 1 |
| 1 | | 1 |
+---+ +---+
However, if you were to push 5,4,3,2,1, the stacks would look like this:
MAIN MIN
+---+ +---+
| 1 | | 1 |
| 2 | | 2 |
| 3 | | 3 |
| 4 | | 4 |
| 5 | | 5 |
+---+ +---+
For 5,2,4,3,1 you would have:
MAIN MIN
+---+ +---+
| 1 | | 1 |
| 3 | | 2 |
| 4 | | 2 |
| 2 | | 2 |
| 5 | | 5 |
+---+ +---+
and so on.
You can also save some space by pushing to the min stack only when the minimum element changes, iff the items are known to be distinct.
The following solution uses O(1) extra memory and O(1) time for max, push and pop operations.
Keep a variable max which will keep track of the current max element at any particular time.
Lets utilize the fact that when max is updated, all the elements in the stack should be less than the new max element.
When a push operation occurs and the new element(newElement) is greater than the current max we push the max + newElement in the stack and update max = newElement.
When we are doing a pop operation and we find that the current popped element is greater than the current max then we know that this is place where we had updated our stack to hold max+elem. So the actual element to be returned is max and max = poppedElem - max.
For eg. if we are pushing 1, 2, 3, 4, 5 the stack and max variable will look like below:
MAIN Value of MAX
+---+ +---+
| 9 | max = | 5 |
| 7 | max = | 4 |
| 5 | max = | 3 |
| 3 | max = | 2 |
| 1 | max = | 1 |
+---+ +---+
Now lets say we pop an element, we will basically pop, max element(since top > max) and update the max element to (top-max)
MAIN Value of MAX
+---+ +---+
| 7 | max = | 4 | = (9-5)
| 5 | max = | 3 |
| 3 | max = | 2 |
| 1 | max = | 1 |
+---+ +---+
Now lets say we are pushing numbers 5, 4, 3, 2, 1, the stack will look like:
MAIN Value of MAX
+---+ +---+
| 1 | max = | 5 |
| 2 | max = | 5 |
| 3 | max = | 5 |
| 4 | max = | 5 |
| 5 | max = | 5 |
+---+ +---+
When we pop, the top of stack is popped since top < max, and max remains unchanged.
Following is a pseudo code for each of the operation for better insight.
Elem max;
void Push(Elem x){
if x < max :
push(x);
else{
push(x+max);
max = x;
}
}
Elem Pop(){
Elem p = pop();
if |p| < |max|:
return p;
else{
max = p - max;
return max;
}
}
Elem Max(){
return max;
}
push and pop are normal stack operations. Hope this helps.
#KennyTM's comment points out an important missing detail - insert where, and delete from where. So I am going to assume that you always want to insert and delete only from one end like a stack.
Insertion (push) and Delete (pop) are O(1).
To get Max in O(1), use an additional stack to record the current max which corresponds to the main stack.
If you are using only comparisons, you would be hard pressed to find such a data structure.
For instance you could insert n elements, get max, delete max etc and could sort numbers in O(n) time, while the theoretical lower bound is Omega(nlogn).
Below program keeps track of max elements in stack in such a way that any point of time the top pointer would give us the max in the stack :
So, max would be O(1), and we can find max by max[N]
ITEM MAX
+---+ +---+
| 1 | | 1 |
| 10| | 10|
| 9 | | 10|
| 19| | 19| <--top
+---+ +---+
Java Program:
public class StackWithMax {
private int[] item;
private int N = 0;
private int[] max;
public StackWithMax(int capacity){
item = new int[capacity];//generic array creation not allowed
max = new int[capacity];
}
public void push(int item){
this.item[N++] = item;
if(max[N-1] > item) {
max[N] = max[N-1];
} else {
max[N] = item;
}
}
public void pop() {
this.item[N] = 0;
this.max[N] = 0;
N--;
}
public int findMax(){
return this.max[N];
}
public static void main(String[] args) {
StackWithMax max = new StackWithMax(10);
max.push(1);
max.push(10);
max.push(9);
max.push(19);
System.out.println(max.findMax());
max.pop();
System.out.println(max.findMax());
}
}
Like some have already pointed out, the question lacks some information. You don't specify were to insert/delete, nor the nature of the data we are dealing with.
Some ideas that could be useful: You say,
insert/delete/maximum operation in O(1)
note that if we can insert, delete, and find maximun in O(1), then we can use this hipotetical technique to sort in O(n), because we can insert the n elements, and then take max/delete and we get them all sorted. It's proven that no sorting algorithm based in comparisons can sort in less than O(nlogn), so we know that no comparison based aproach will work. In fact, one of the fastest known ways of doing this is the Brodal queue, but it's deletion time exceeds O(1).
Maybe the solution is something like a radix tree, were the complexity of all these operations is related to the key length as oposed to the amount of keys. This is valid only if they let you bound the key length by some other number, so you can consider it constant.
But maybe it wasn't something that generic. Another interpretation, is that the insert/delete are the ones of a classic stack. In that restricted case, you can use the double stack solutiom that Can Berk Güder gave you.
The best thing exists is:
Insert in O(1)
Delete in O(logn)
Max/Min in O(1)
But to do that the insert function must create a link chain and you will also need an extra thread. The good news is that this link chain function also works in O(1) so it will not change the O(1) of insert.
Delete function doesnt break the link chain.
If the target of your delete is the max or min then the delete will be executed in O(1)
The data structure is a mix of an avl tree and a linked list.
The nature of a true delete is such that you cannot make it work in O(1). Hash tables which work with O(1) delete dont have the cabability to hold all the inputs.
A hash table might support insert/delete in O(1), no clue about maximum though. You'd probably need to keep track of it yourself somehow.

Resources