Dealing with huge amounts of data in continuum percolation

Dealing with huge amounts of data in continuum percolation - algorithm

For a group project I'm currently part of, we have to simulate the following:
Take a square with side length n. Distribute some amount of unit disks uniformly over the square. Find the number of disks required until there is a connected component of disks stretching from the left side to the right side of the square. Then find the number of disks required until the square is entirely filled with disks.
It's not explicitly stated, but we assume this is to be done in Matlab, as we use it in other parts of this course. For the first problem, finding a path from the left to the right, I've written two methods that work. One uses an adjacency list and the graph tools in Matlab for finding connected nodes. This method is fast enough but takes up far too much memory for what we need to do. The other method uses a recursive search algorithm without storing adjacency information, but is far too slow.
The problem arises when we need the square to be of sizes n=1000 and n=10 000. We predict this will require some tens of millions of circles, or more, and we simply do not see how we're supposed to handle this as any adjacency list or matrix would be ridiculously large, and not using one seems to require a huge amount of time. Any thoughts and ideas are appreciated, thanks

Lets reformulate your problems as decision problems like this:
Starting with seed s for the PRNG, randomly distribute m unit disks in an n x n square, and determine whether or not they make a connected path from the left side to the right.
and:
Starting with seed s for the PRNG, randomly distribute m unit disks in an n x n square, and determine whether or not they cover the entire square.
If you can solve these decision problems, then you can find the minimum value for m by doing a binary search over the possible values.
You can solve these decision problems without using a whole lot of RAM by generating your disks like so:
Initialize your PRNG with the given seed s
For each disk 1 ... m, randomly select the column it will go in (that's floor(center.x)). Accumulate counts for all the columns to determine how many disks will go in each one. Let COUNT(col) be the number of disks to generate in column col
For each col in 0 ... n-1, generate the COUNT(col) disks uniformly distributed within that column.
Now you are generating your disks predominantly from left to right. Both of the above decision problems have solutions that work from left to right, and only need to see the disks in one or two columns at a time, so you only need to remember one or two columns worth of disks.
For the full coverage problem, work from left to right and determine whether each column in the squrare is completely covered, using disk from that column and adjacent columns. No other disks could possibly reach.
For the left-right path problem, work from left to right using union-find to keep track of which disks in the current column are connected to the left side and each other. When you get to the last column you can check to see if any disks in the last column are connected to the left side and extend past the right.

Related

Bark at a Sphere Tree or Look Somewhere Else?

The scenario: A large number of players, playing a real-time game in 3d space, must be organized in a way where a server can efficiently update other players and any other observer of a players move and actions. What objects 'talk' to one another needs to be culled based on their range from one another in simulation; this is to preserve network sanity, programmer sanity, and also to allow for server-lets to handle smaller chunks of the overall world play-space.
However, if you have 3000 players, this runs into the issue that one must run 3000! Calculations to find out the ranges between everything. (Google tells me that ends up as a number with over 9000 digits; that’s insane and not worth considering for a near-real-time environment.)
Daybreak Games seems to have solved this problem with their massive online first person shooting game Planetside 2; it had allowed 3000 players to play on a shared space and have real-time responsiveness. They’ve apparently done it through a “Sphere Tree” data structure.
However, I’m not positive this is the solution they use, and I’m still a questioning how to apply the concept of "Sphere Trees" to reduce the range calculations for culling to a reasonable amount.
If Sphere Trees are not the right tree to bark up, what else should I be directing my attention at to tackle this problem?
(I'm a c# programmer (mainly), but I'm looking for a logical answer, not a code one)
References I’ve found about sphere trees;
http://isg.cs.tcd.ie/spheretree/#algorithms
https://books.google.com/books?id=1-NfBElV97IC&pg=PA385&lpg=PA385#v=onepage&q&f=false

Here are a few of my thoughts:
Let n denote the total number of players.
I think your estimate of 3000! is wrong. If you want to calculate all pairs distances given a fixed distance matrix, you run 3000 choose 2 operations, in the order of O(n^2*t), where t is the number of operations you spend calculating the distance between two players. If you build the graph underlying the players with edge weights being the Euclidean distance, you can reduce this to the all-pairs shortest paths problem, which is doable via the Floyd-Warshall algorithm in O(n^3).
What you're describing sounds pretty similar to doing a range query: https://en.wikipedia.org/wiki/Range_searching. There are a lot of data structures that can help you, such as range trees and k-d trees.

If objects only need to interact with objects that are, e.g., <= 100m away, then you can divide the world up into 100m x 100m tiles (or voxels), and keep track of which objects are in each non-empty tile.
Then, when one object needs to 'talk', you only need to check the objects in at most 9 tiles to see if they are close enough to hear it.

Utility Maximizing Assignment

I posted this on computer science section but no one replied :(. Any help would be greatly appreciated :).
There is a grid of size MxN. M~20000 and N~10. So M is very huge. So one way is to look at this is N grid blocks of size M placed side by side. Next assume that there are K number of users who each have a utility matrix of MxN, where each element provides the utility that the user will obtain if that user is assigned that grid element. The allocation needs to be done in a way such for each assigned user total utility must exceed a certain threshold utility U in every grid block. Assume only one user can be assigned one grid element. What is the maximum number of users that can be assigned?. (So its okay if some users are not assigned ).
Level 2: Now assume for each user at least n out N blocks must exceed utility threshold U. For this problem, whats the maximum number of users that can be assigned.
Of course brute force search is of no use here due to K^(MN) complexity. I am guessing that some kind of dynamic programming approach maybe possible.

To my understanding, the problem can be modelled as a Maximum Bipartite Matching problem, which can be solved efficiently with the Hungarian algorithm. In the left partition L, create K nodes, one for each user. In the right partition R, create L*M*N nodes, one for each cell in the grid. As edges create edges for each l in L and r in R with cost equal to the cost of the assignment of user l to the grid cell r.

Using a different interpretation of your question than Codor, I am going to claim that (at least in theory, in the worst case) it is a hard problem.
Suppose that we can solve it in the special case when there is one block which must be shared between two users, who each have the same utility for each cell, and the threshold utility U is (half of the total utility for all the cells in the block), minus one.
This means that in order to solve the problem we must take a list of numbers and divide them up into two sets such that the sum of the numbers in each set is the same, and is exactly half of the total sum of the numbers available.
This is http://en.wikipedia.org/wiki/Partition_problem, which is NP complete, so if you could solve your problem as I have described it you could solve a problem known to be hard.
(However the Wikipedia entry does say that this is known as "the easiest hard problem" so even if I have described it correctly, there may be solutions that work well in practice).

Fast method of tile fitting/arranging

For example, let's say we have a bounded 2D grid which we want to cover with square tiles of equal size. We have unlimited number of tiles that fall into a defined number of types. Each type of tile specifies the letters printed on that tile. Letters are printed next to each edge and only the tiles with matching letters on their adjacent edges can be placed next to one another on the grid. Tiles may be rotated.
Given the size of the grid and tile type definitions, what is the fastest method of arranging the tiles such that the above constraint is met and the entire/majority of the grid is covered? Note that my use case is for large grids (~20 in each dimension) and medium-large number of solutions (unlike Eternity II).
So far, I've tried DFS starting in the center and picking the locations around filled area that allow the least number of possibilities and backtrack in case no progress can be made. This only works for simple scenarios with one or two types. Any more and too much backtracking ensues.
Here's a trivial example, showing input and the final output:

This is a hard puzzle.
The Eternity 2 was a puzzle of this form with a 16 by 16 square grid.
Despite a 2 million dollar prize, no one found the solution in several years.
The paper "Jigsaw Puzzles, Edge Matching, and Polyomino Packing: Connections and Complexity" by Erik D. Demaine, Martin L. Demaine shows that this type of puzzle is NP-complete.

Given a problem of this sort with a square grid I would try a brute force list of all possible columns, and then a dynamic programming solution across all of the rows. This will find the number of solutions, and with a little extra work can be used to generate all of the solutions.
However if your rows are n long and there are m letters with k tiles, then the brute force list of all possible columns has up to mn possible right edges with up to m4k combinations/rotations of tiles needed to generate it. Then the transition from the right edge of one column to the right edge of the next next potentially has up to m2n possibilities in it. Those numbers are usually not worst case scenarios, but the size of those data structures will be the upper bound on the feasibility of that technique.
Of course if those data structures get too large to be feasible, then the recursive algorithm that you are describing will be too slow to enumerate all solutions. But if there are enough, it still might run acceptably fast even if this approach is infeasible.
Of course if you have more rows than columns, you'll want to reverse the role of rows and columns in this technique.

Algorithm for container planning

OK guys I have a real world problem, and I need some algorithm to figure it out.
We have a bunch of orders waiting to be shipped, each order will have a volume (in cubic feet), let's say, V1, V2, V3, ..., Vn
The shipping carrier can provide us four types of containers, and the volume/price of the containers are listed below:
Container Type 1: 2700 CuFt / $2500;
Container Type 2: 2350 CuFt / $2200;
Container Type 3: 2050 CuFt / $2170;
Container Type 4: 1000 CuFt / $1700;
No single order will exceed 2700 CuFt but likely to pass 1000 CuFt.
Now we need a program to get an optimized solution on freight charges, that is, minimum prices.
I appreciate any suggestions/ideas.
EDIT:
My current implementation is using biggest container at first, apply the first fit decreasing algorithm for bin packing to get result, then parse through all containers and adjust container sizes according to the content volume...

I wrote a similar program when I was working for a logistics company. This is a 3-dimensional bin-packing problem, which is a bit trickier than a classic 1-dimensional bin-packing problem - the person at my job who wrote the old box-packing program that I was replacing made the mistake of reducing everything to a 1-dimensional bin-packing problem (volumes of boxes and volumes of packages), but this doesn't work: this problem formulation states that three 8x8x8 packages would fit into a 12x12x12 box, but this would leave you with overlapping packages.
My solution was to use what's called a guillotine cut heuristic: when you put a package into the shipping container then this produces three new empty sub-containers: assuming that you placed the package in the back bottom left of the container, then you would have a new empty sub-container in the space in front of the package, a new empty sub-container in the space to the right of the package, and a new empty sub-container on top of the package. Be certain not to assign the same empty space to multiple sub-containers, e.g. if you're not careful then you'll assign the section in the front-right of the container to the front sub-container and to the right sub-container, you'll need to pick just one to which to assign it. This heuristic will rule out some optimal solutions, but it's fast. (As a concrete example, say you have a 12x12x12 box and you put an 8x8x8 package into it - this would leave you with a 4x12x12 empty sub-container, a 4x8x12 empty sub-container, and a 4x8x8 empty sub-container. Note that the wrong way to divide up the free space is to have three 4x12x12 empty sub-containers - this is going to result in overlapping packages. If the box or package weren't cubes then you'd have more than one way to divide up the free space - you'd need to decide whether to maximize the size of one or two sub-containers or to instead try to create three more or less equal sub-containers.) You need to use a reasonable criteria for ordering/selecting the sub-containers or else the number of sub-containers will grow exponentially; I solved this problem by filling the smallest sub-containers first and removing any sub-container that was too small to contain a package, which kept the quantity of sub-containers to a reasonable number.
There are several options you have: what containers to use, how to rotate the packages going into the container (there are usually six ways to rotate a package, but not all rotations are legal for some packages e.g. a "this end up" package will only have two rotations), how to partition the sub-containers (e.g. do you assign the overlapping space to the right sub-container or to the front sub-container), and in what order you pack the container. I used a randomized algorithm that approximated a best-fit decreasing heuristic (using volume for the heuristic) and that favored creating one large sub-container and two small sub-containers rather than three medium-sized sub-containers, but I used a random number generator to mix things up (so the greatest probability is that I'd select the largest package first, but there would be a lesser probability that I'd select the second-largest package first, and so on, with the lowest probability being that I'd select the smallest package first; likewise, there was a chance that I'd favor creating three medium-sized sub-containers instead of one large and two small, there was a chance that I'd use three medium-sized boxes instead of two large boxes, etc). I then ran this in parallel several dozen times and selected the result that cost the least.
There are other heuristics I considered, for example the extreme point heuristic is slower (while still running in polynomial time - IIRC it's a cubic time solution, whereas the guillotine cut heuristic is linear time, and at the other extreme the branch and bound algorithm finds the optimal solution and runs in exponential time) but more accurate (specifically, it finds some optimal solutions that are ruled out by the guillotine cut heuristic); however, my use case was that I was supposed to produce a fast shipping estimate, and so the extreme point heuristic wasn't appropriate (it was too slow and it was "too accurate" - I would have needed to add 10% or 20% to its results to account for the fact that the people actually packing the boxes would inevitably make sub-optimal choices).
I don't know the name of a program offhand, but there's probably some commercial software that would solve this for you, depending on how much a good solution is worth to you.

Zim Zam's answer is good for big boxes, but assuming relatively small boxes you can use a much simpler algorithm that amounts to solving an integral linear equation with a constraint:
Where a, b, c and d are integers being the number of each type of container used:
Given,
2700a + 2350b + 2050c + 1000d <= V (where V is the total volume of the orders)
You want to find a, b, c, and d such that the following function is minimized:
Total Cost C = 2500a + 2200b + 2170c + 1700d
It would seem you can brute force this problem (which is NP hard). Calculate every possible viable combination of a, b, c and d, and calculate the total cost for each combination. Note that no solution will ever use more than 1 container of type d, so that cuts down the number of possible combinations.
I am assuming orders can be split between containers.

Efficiently querying a B+ Tree holding multidimensional data

I have a collection of tuples (x,y) of 64-bit integers that make up my dataset. I have, say, trillions of these tuples; it is not feasible to keep the dataset in memory on any machine on earth. However, it is quite reasonable to store them on disk.
I have an on-disk store (a B+-tree) that allow for the quick, and concurrent, querying of data in a single dimension. However, some of my queries rely on both dimensions.
Query examples:
Find the tuple whose x is greater than or equal than some given value
Find the tuple whose x is as small as possible s.t. it's y is greater than or equal to some given value
Find the tuple whose x is as small as possible s.t. it's y is less than or equal to some given value
Perform maintenance operations (insert some tuple, remove some tuple)
The best bet I have found are Z-order curves but I cannot seem to figure out how to conduct the queries given my two dimensional data-set.
Solutions that are not acceptable include a sequential scan of the data, this could be far too slow.

I think, the most appropriate data structures for your requirements are R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree). R-tree is similar to B+-tree, but also allows multidimensional queries.
Other relevant data structure is Priority Search Tree. It is good for queries like your examples 1 .. 3, but not very efficient if you need frequent updates or on-disk store. For details see this paper or this book: "Handbook of Data Structures and Applications" (Chapter 18.5).

Are you saying you don't know how to query z-order curves? The Wikipedia page describes how you do range searches.
A z-curve divides your space into nested rectangles, where each additional bit in the key divides the space in half. To search for a point:
Start with the largest rectangle that might contain your point.
Recursively:
Create a result set of rectangles
For each rectangle in your set
If the rectangle is a single point, you are done, it is what you are looking for.
Otherwise, divide the rectangle in two (specify one additional bit of the z-curve)
If both halves contain a point
If one half is better
Add that rectangle to your result set of rectangles
Otherwise
Add both rectangles to your result set of rectangles
Otherwise, only one half contains a point
Add that rectangle to your result set of rectangles
Search your result set of rectangles
Worst case performance is bad, of course. You can adjust it by changing how you construct your z-order index.

I'm currently working on designing a data structure which is essentially a 'stacked' B+ tree (or a d+ tree where d is the number of dimensions) for multidimensional data. I believe it would suit your data perfectly and is being designed specifically for your use case.
The basic idea is this:
Each dimension is a B+ tree and is linked to the next dimension's B+ tree. Search through the first dimension normally, once a leaf is reached it contains a pointer to the root of the next B+ tree which belongs to the next dimension. Everything in the second B+ tree belongs to the same x value.
The original plan was to only store the unique values for each dimension along with it's count. This employs a very simple compression algorithm (if you can even call it that) while still allowing for the entire data set to be represented. This 'linked' dimension scheme could allow for extra dimensions to be added later as they are simply added to the stack of B+ trees.
Total insert/search/delete time for 2 dimensions would be something similar to this:
log b(card(x)) + log b(card(y))
where b is the base of each B+ tree and card(x) would be the cardinality of the x dimension.
I hope that makes sense. I'm still working on an implementation, however feel free to use or even augment the idea.

http://fallabs.com/tokyocabinet/
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX. Tokyo Cabinet is a free software licensed under the GNU Lesser General Public License.
may it easy for u to embed?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio