This question is an enhancement to the previous SO question.
Distance Calculation for massive number of devices/nodes
I have N mobile devices/nodes (say 100K) and I periodically obtain their location ( latitude , longtitude ) values.
Some of the devices are "logically connected" to roughly M other devices (say 10 in average). My program periodically compares the distance between the each device and its logically connected devices and determines if the distance is within a threshold (say 100 meters).
Furthermore number of logical connections "K" can also be more then one and (say 5 in average)
Example is A can be connected to B,C for i.e. "parents" logic. A can also be connected to C,D,E,F for "work" logic
I need a robust algorithm to calculate these distances to the logically connected devices.
The complexity order of brute force approach would be NMK or (Θ3 in terms of order)
The program does this every 3 seconds (all devices are mobile), thus for instance 100K*10*5 = 5M calculations every 3 seconds is not good.
Any good/classical algorithms for this operation ?
I decided to rewrite my answer after a bit more thought.
The complexity of your problem is not O(N^3) in the worst case, it is actually only O(N^2) in the worst case. It's also not O(N*M*K) but rather O(N*(M+K)), where O(M+K) is O(N). However, the real complexity of your problem is O(E) where E is the total number of logical connections (number of work connections + number of parent connections). Unless you want to approximate, your solution cannot be better than O(E). Your averages suggest that you likely have on the order of 5 million connections, which is on the order of O(N log N).
You example uses two sets of logical connections. So you would simply cycle through each set and check if distance between the devices of the logical connection is within the threshold.
That being said, the example you gave and your assumed time complexity suggests you are interested in more than just if the individual connections are within threshold, but rather if sets of connections are within threshold. Specifically, in your example it would return True if parents logic: (A,B), (A,C) and Work logic (A,C),(A,D),(A,E),(A,F) are all True. In which case your best data structure would be a dictionary of dictionaries that looks like the following in Python (includes the optimization below):
"parentsLogic[A][B] = (last position A, last position B, was within threshold)".
If it's common that the positions don't change much, you may obtain some run-time improvement by storing the previous positions and if they were within the threshold or not (Boolean). The benefit is that you can simply return the previous result if the two positions haven't changed and updating them if they have changed.
You can use a brute force algorithm and sort the result then use the top best groups.
One thing you can do in addition to what was suggested in the answers to the previous question is to store a list of the nearby connected devices for every device and update it only for those devices that have moved by a significant distance since last update (and for the devices connected to those that have moved).
For example, if the threshold is 100 m, store a list of the connected devices within 200 m of every device, and update it for every device that has moved more than by 50 m since last update.
Related
Grid Illumination: Given an NxN grid with an array of lamp coordinates. Each lamp provides illumination to every square on their x axis, every square on their y axis, and every square that lies in their diagonal (think of a Queen in chess). Given an array of query coordinates, determine whether that point is illuminated or not. The catch is when checking a query all lamps adjacent to, or on, that query get turned off. The ranges for the variables/arrays were about: 10^3 < N < 10^9, 10^3 < lamps < 10^9, 10^3 < queries < 10^9
It seems like I can get one but not both. I tried to get this down to logarithmic time but I can't seem to find a solution. I can reduce the space complexity but it's not that fast, exponential in fact. Where should I focus on instead, speed or space? Also, if you have any input as to how you would solve this problem please do comment.
Is it better for a car to go fast or go a long way on a little fuel? It depends on circumstances.
Here's a proposal.
First, note you can number all the diagonals that the inputs like on by using the first point as the "origin" for both nw-se and ne-sw. The diagonals through this point are both numbered zero. The nw-se diagonals increase per-pixel in e.g the northeast direction, and decreasing (negative) to the southwest. Similarly ne-sw are numbered increasing in the e.g. the northwest direction and decreasing (negative) to the southeast.
Given the origin, it's easy to write constant time functions that go from (x,y) coordinates to the respective diagonal numbers.
Now each set of lamp coordinates is naturally associated with 4 numbers: (x, y, nw-se diag #, sw-ne dag #). You don't need to store these explicitly. Rather you want 4 maps xMap, yMap, nwSeMap, and swNeMap such that, for example, xMap[x] produces the list of all lamp coordinates with x-coordinate x, nwSeMap[nwSeDiagonalNumber(x, y)] produces the list of all lamps on that diagonal and similarly for the other maps.
Given a query point, look up it's corresponding 4 lists. From these it's easy to deal with adjacent squares. If any list is longer than 3, removing adjacent squares can't make it empty, so the query point is lit. If it's only 3 or fewer, it's a constant time operation to see if they're adjacent.
This solution requires the input points to be represented in 4 lists. Since they need to be represented in one list, you can argue that this algorithm requires only a constant factor of space with respect to the input. (I.e. the same sort of cost as mergesort.)
Run time is expected constant per query point for 4 hash table lookups.
Without much trouble, this algorithm can be split so it can be map-reduced if the number of lampposts is huge.
But it may be sufficient and easiest to run it on one big machine. With a billion lamposts and careful data structure choices, it wouldn't be hard to implement with 24 bytes per lampost in an unboxed structures language like C. So a ~32Gb RAM machine ought to work just fine. Building the maps with multiple threads requires some synchronization, but that's done only once. The queries can be read-only: no synchronization required. A nice 10 core machine ought to do a billion queries in well less than a minute.
There is very easy Answer which works
Create Grid of NxN
Now for each Lamp increment the count of all the cells which suppose to be illuminated by the Lamp.
For each query check if cell on that query has value > 0;
For each adjacent cell find out all illuminated cells and reduce the count by 1
This worked fine but failed for size limit when trying for 10000 X 10000 grid
For the sake of security, I probably can't post any of our files' code, but I can describe what's going on. Basically, we have standalone items and others that are composed of smaller parts. The current system we have in place works like this. Assume we have n items and m parts for each of the kits, where m is not constant and less than n in all cases.
for(all items){
if(standalone){
process item, record available quantity and associated costs
write to database
}
if(kit){
process item, get number of pre-assembled kits
for(each part){
determine how many are used to produce one kit
divide total number of this specific part by number required, keep track of smallest result
add cost of this item to total production cost of item
}
use smallest resulting number to determine total available quantity for this kit
write record to database
}
}
At first, I wanted to say that the total time taken for this is O(n^2) but I'm not convinced that's correct given that about n/3 of all items are kits and m generally ranges between 3 to 8 parts. What would this come out to? I've tested it a few times and it feels like it's not optimized.
From the pseudo-code that you have posted it is fairly easy to work out the cost. You have a loop over n items (thus this is O(n)), and inside this loop have another loop of O(m). As you worked out nested loops mean that the orders are multiplied: if they were both of Order n then this would give O(n^2); instead it is O(mn).
This has assumed that the processing that you have mentioned runs in constant time (i.e. is independent of the size of the inputs). If those descriptions hide some other processing time then this analysis will be incorrect.
I have created the following mathematical abstraction for an optimisation problem.
O - Order
C - customer
P - Producer
S - Productionslot(related to producer)
pc - Productioncost per slot
tc - Transportcost for an order produceded in a specific slot
a - Productamount per order
ca - Slotcapacity
e - Endtime of a slot
dt - Latest order arrival time
tt - Transporttime
Decisionvariables:
x - Produce in a specific slot
y - Produce an order in an specific slot
The following cost function has to be minimised:
Meaning each order has to been processed in one production slot. Depending on the producer connected to this slot and its distance to the customer transportcost arise. If at least one order is produced in an slot, some sort of fixed cost for this production will occure. The aggregated sum of this costs should be minimised.
The following conditions apply:
This optimisation has to been repeated every day, with a different set of orders(o and a will vary).
Each producer will have 4 to 5 slots he produces in. The number of producers is 20. Orders are round about 100.
The last condition can be used to reduce the problem in advance. Meaning set all y's to 0 where the arrival time can not be reached, because arrival time is before productionend time plus transport time.
As far as I can see this problem and its size can not be easy used be just iterating through all feasible solutions. Even if you use condition 2 to reduced the combinations of "y" strongly, I will be left with approximately (5*20)^200 solutions to check(assign each order to each slot and check all combinations of order associations with respect to rest condtion fulfillment and cost function value).
If I do some mathematical operations I can form the problem to look pretty close to an Multidimensional Knapsack problem.
E.g.
- multiplie the objective function by -1 to have a max problem.
- Split condition 2 in two conditions to get less equal conditions(multiply the greater than with -1 as well)
- Move the capacity term in condition 3 to the left side
But these operations lead to partly negative coefficient. Algorithms found in the literature to approach the MDK often assume them to be positiv, is that a problem?
Has someone a good branch and bound approach to capture the problem? Or do I have to use I Metaheuristic as often propossed in literature, and has someone used a specific approach to solve a similar problem before to tell me about his/her experience?
Has someone other moddeling tricks to reduce the problem complexity in the given situation.
I come across the following problem of distributing load over a number of machines in a network. The problem is an interview question. The candidate should specify which algorithm and which data structures are the best for this problem.
We have N machines in a network. Each machine can accept up to 5 units of load. The requested algorithm receives as input a list of machines with their current load (ranging form 0-5), the distance matrix between the machines, and the new load M that we want to distribute on the network.
The algorithm returns the list of machines that can service the M units of load and have the minimum collective distance. The collective distance is the sum of the distances between the machines in the resulting list.
For example if the resulting list contains three machines A, B and C these machines can service collectively the M units of load (if M=5, A can service 3, B can service 1, C can service 1) and the sum of distances SUM = AB + BC is the smallest path that can collectively service the M units of load.
Do you have any proposals on how to approach it?
The simplest approach I can think of, is defining a value for every machine, something like the summation of inverted distances between this machine and all it's adjacent machines:
v_i = sum(1/dist(i, j) for j in A_i)
(Sorry I couldn't manage to put math formula here)
You can invert the summation again, and call it machine's crowd value (or something like that), but you don't need to.
Then sort machines based on this value (descending if you have inverted the summation value).
Starting with the machine with minimum value (maximum crowd) and add as much as load as you can. Then go for the next machine and do the same until you assign all of the load you want.
It sounds like every machine is able to process the same amount of load -- namely 5 units. And the cost measure you state depends only on the set of machines that have nonzero load (i.e. adding more load to a machine that already has nonzero load will not increase the cost). Therefore the problem can be decomposed:
Find the smallest number k <= n of machines that can perform all jobs. For this step we can ignore the individual identities of the machines and how they are connected.
Once you know the minimum number k of machines necessary, decide which k of the n machines offers the lowest cost.
(1) is a straightforward Bin packing problem. Although this problem is NP-hard, excellent heuristics exist and nearly all instances can be quickly solved to optimality in practice.
There may be linear algebra methods for solving (2) more quickly (if anyone knows of one, feel free to edit or suggest in the comments), but without thinking too hard about it you can always just use branch and bound. This potentially takes time exponential in n, but should be OK if n is low enough, or if you can get a decent heuristic solution that bounds out most of the search space.
(I did try thinking of a DP in which we calculate f(i, j), the lowest cost way of choosing i machines from among machines 1, ..., j, but this runs into the problem that when we try adding the jth machine to f(i - 1, j - 1), the total cost of the edges from the new machine to all existing machines depends on exactly which machines are in the solution for f(i - 1, j - 1), and not just on the cost of this solution, thus violating optimal substructure.)
I have a large table of N items with M (M>=3) distinct properties per item,
From this table I have to remove all items for which the same table contains an item that scores equal or better on all properties.
I have an algorithm (python) that solves it already, but it is output-sensitive and has a worst case of approx. O((n²+n)/2) when no items are removed in the process.
This is far too slow for my project (where datasets of 100,000 items with 8 properties per item are not uncommon), so I require something close to O(m*n log n) worst case, but I do not know whether this problem can be solved that fast.
Example problem case and its solution:
[higher value = better]
Singing Dancing Acting
A 10 20 10
B 10 20 30
C 30 20 10
D 30 10 30
E 10 30 20
F 30 10 20
G 20 30 10
Dismiss all candidates for which there is a candidate that performs equal or
better in all disciplines.
Solution:
- A is dismissed because B,C,E,G perform equal or better in all disciplines.
- F is dismissed because D performs equal or better in all disciplines.
Does there exist an algorithm that solves this problem efficiently, and what is it?
A general answer is to arrange the records into a tree and keep notes at each node of the maximum value in each column for the records lying beneath that node. Then, for each record, chase it down the tree from the top until you know whether it is dominated or not, using the notes at each node to skip entire subtrees, if possible. (Unfortunately you may have to search both descendants of a node). When you remove a record as dominated you may be able to update the annotations in the nodes above it - since this need not involve rebalancing the tree it should be cheap. You might hope to at least gain a speedup over the original code. If my memory of multi-dimensional search is correct, you could hope to go from N^2 to N^(2-f) where f becomes small as the number of dimensions increases.
One way to create such a tree is to repeatedly split the groups of records at the median of one dimension, cycling through the dimensions with each tree level. If you use a quicksort-like median search for each such split you might expect the tree construction to take you time n log n. (kd-tree)
One tuning factor on this is not to split all the way down, but to stop splitting when the group size gets to some N or less.
It looks like this paper, http://flame.cs.dal.ca/~acosgaya/Research/skyline/on%20finding%20the%20maxima%20of%20a%20set%20of%20a%20vectors.pdf addresses your problem.
Above link is broken. Here is another one:
http://www.eecs.harvard.edu/~htk/publication/1975-jacm-kung-luccio-preparata.pdf
What you have here is a partially ordered set so that A <= B if all traits of A have values less than or equal to B, and A >= B if all traits of A have values greater than or equal to B. It is possible that !(A<=B || A>=B), in this case A and B are "incomparable". Your problem is to eliminate from the set those elements which are dominated by other elements, e.g. remove every A s.t. there exists B in the set so that A < B.
In the worst case all the elements are incomparable, i.e. you can't eliminate anything. Now let's look at the incomparability relationship. Suppose A !~ B (incomparability) and B !~ C. Is it possible that A and C are still comparable? Yes! For example A could have traits {1,2,3}, B {2,1,5} and C {2,3,4}. This means that incomparability is not "transitive" and therefore you are kind of out of luck; in general to check that all the elements are incomparable IS going to take O(N^2) time, as far as I understand.