Finding non-monotonic regions in decision trees - algorithm

I have a binary decision tree T that takes a vector V of n real numbers, and outputs a number S by following per coordinate binary splits on V. I'd like to find regions of the tree that are non-monotonic. That is, if I decrease one more more inputs in V to form V' and the tree then assigns a larger output to V' than to V, then I've found a non-monotonic region.
How can find these regions?

I'm assuming that "per coordinate binary splits" means that decisions are made on a single coordinate at a time. For all pairs of leaves L1 and L2 where L1 has lower value than L2, determine the axis-aligned bounding boxes for L1 and L2. If L1's maximum corner dominates L2's minimum corner for some L1 and L2, then the tree is non-monotone. Conversely, if no such pair exists, then the tree is monotone.

I am not providing particulars, just the general direction. let me know if you need more details.
Assume you have a tree that considers a single feature (i.e. one real number) and outputs either one single number or a range. Now, it is straightforward to find non-monotonic regions of that tree (at any node, if the range of the left subtree overlaps the range of the right subtree, then there are non-monotonic regions in that part of the tree).
You can convert your general DT into a DT that works on only one feature and apply the above methodology.
In general you can maintain the range for each feature at every node and use the same criteria I mention above to find such regions.

Related

Max flow: how to force f units to flow, with minimal changes to capacity?

Let's say I have a graph and run a max-flow on it. I get some flow, f. However, I want to to flow f1 units where f1>f. Of course, I need to go about increasing some of the edge capacities. I want to make as small a total increase as possible to the capacities. Is there a clever algorithm to achieve this?
If it helps, I care for my application about bi-partite graphs with source (s) to left vertices (L) having some finite, integer capacities (c_l), left vertices L to right vertices R having some connectivity with infinite capacities and all right vertices, R connected to a sink vertex with finite integer capacities (c_r). Here, c_l and c_r sum to the same number. Also, there are no connections among the left vertices or among the right ones.
An example is provided in the image below. The blue numbers are the flow capacities and the pink numbers are the actual flows in the max-flow. Currently, 5 units are flowing but I want 9 units to flow.
In general, turn the flow instance into a min-cost flow instance by setting the cost of existing arcs to zero and adding new, infinite-capacity arcs doubling them of cost one.
For these particular instances, the best you're going to do is to repeatedly find an unsaturated arc of finite capacity and push flow along any path that includes it. Once everything's saturated just use any path.
This seems a little too easy to be what you want, so I'll mention that it's possible to formulate more sophisticated objectives and solve them using linear programming techniques.
The graph is undirected, and all the "middle" vertices have infinite capacity. That means we can unify all vertices connected by infinite capacity in L and R, making a very simple graph indeed.
For example, in the above graph, an equivalent graph would be:
s -8-> Vertex 1+2+4 -4-> t
s -1-> Vertex 3+5 -5-> t
So we end up with just a bunch of unique paths with no branching. We can unify the nodes with a simple "floodfill" or DFS type search on infinite-capacity edges. When we unify nodes, we add up their "left" and "right" capacities.
To maximize flow in this graph we:
First, if the left and right paths are not equal, increase the lower one until they are equal. This lets us convert an increase of cost X, into an increase in flow of X.
Once the left and right paths are equal for all nodes, we pick any path. Then, we increase both halves of the path with cost 2X, increasing the flow by X.

How can I count number of grid graph's trees in short time?

Assume a situation where (L + 1)^2 nodes are place in a L x L grid. Nodes are randomly linked with nearest neighbors (so the degree of node is 0,1,2,3 or 4). Of course, the graph isn't generally connected, and say you know the number of connected components of the initial state.
enter image description here (Fig1: example of initial state)
Now you choose two neighboring nodes, and
If they are linked, cut the edge.
If they aren't linked, link them with a new edge.
enter image description here (Fig2: example of the operation)
You repeat this cut-or-link operation many times.
Then, every time you execute the operation, how can you know the number of connected components in short time like O(log L^2) or O(L)? It is easy to compute the number in O(L^2) if you maintain an (L + 1)^2-length array whose element indicates the index of the tree it belongs to. Also, you can use UnionFind algorithm but, because it can't delink nodes, frequently you have to reconstruct the forest in O(L^2 * log L^2).
Does anyone have an idea? It seemed Link/Cut Tree algorithm was suitable, but probably it can't be used in this problem (for example, Link/Cut Tree's cut(v) function receives only one argument where v is an index of node.)

Range Trees: why not save space by default?

Suppose you have a set S of unique points on the 2-dimensional plane. Now, you are expecting a bunch of questions in the form of "is point p present in S?" You decide to build a Range Tree to store your S and answer this question. The basic idea behind a Range Tree is that you first build a balanced binary search tree Tree0 on the 0-th coordinate and then at each node of this Tree0 build yet another balanced search tree Tree1 but this time using 1-st coordinate as your key. Wikipedia article for Range Tree.
Now, I was expecting that Tree1 which is built for every node n0 of Tree0 would hold exactly those points whose 0-th coordinate is equal to the key from n0. However, if you read more about Range Trees, you will see that this is not the case. Specifically:
The root r0 of Tree0 contains a Tree1 which holds all points.
The left child of r0 contains a Tree1 which holds all of the points whose 0-th coordinate is less than the 0-th coordinate at r0.
The right child of r0 contains a Tree1 which holds all of the points whose 0-th coordinate is greater than that from r0.
If you continue this logic, you will see that at each level of the Range Tree all points are stored exactly once. So, each level requires n memory and since the depth of a balanced Tree0 is logn, this gives O(nlogn) as memory requirement.
However, if you would just store the points whose 0-th coordinate exactly matches the key at the node, you would be storing each point once per the entire tree (instead of per level of the tree), which gives O(n) memory requirement.
What is the reason behind storing the points once per level in the Range Tree? Does it allow for some kind of cool range queries or something? So far it looks to me like any query that you could perform on the O(nlogn) version is also available for the O(n) version. What am I missing?
(Expanding #user3386109’s comment into a full answer.)
There are several different data structures for storing 2D collections of points, each of which is optimized for different types of queries. As the name suggests, range trees are optimized for range searches, queries of the form “here’s a rectangle, what are all the points in that rectangle?” The structure of the range tree - storing each point in several different subtrees - is designed so that you can find a span of nodes in 1D containing one axis of the rectangle, then discover all the nodes in the next dimension that are in the other dimension of the rectangle. If you aren’t planning on making any queries of that form, then there’s no need to store things this way. You’re essentially paying for something you aren’t going to use.
There are other data structures you could use for storing a set of points and seeing whether a particular point is present. If that’s the only question you need to answer, a simple hash table might be all you’d need to use. You could also use a regular BST where points are compared first by their first components, then by their second components. (You could also use a k-d tree here if you’d like.)
Hope this helps!

Efficiently checking which of a large collection of nodes are close together?

I'm currently interested in generating random geometric graphs. For my particular problem, we randomly place node v in the unit square, and add an edge from v to node u if they have Euclidean distance <= D, where D=D(u,n) varies with u and the number of nodes n in the graph.
Important points:
It is costly to compute D, so I'd like to minimize the number of calls to this function.
The vast majority of the time, when v is added, edges uv will be added to only a small number of nodes u (usually 0 or 1).
Question: What is an efficient method for checking which vertices u are "close enough" to v?
The brute force algorithm is to compute and compare dist(v,u) and D(u,n) for all extant nodes u. This requires O(n2) calls to D.
I feel we should be able to do much better than this. Perhaps some kind of binning would work. We could divide the space up into bins, then for each vertex u, we store a list of bins where a newly placed vertex v could result in the edge uv. If v ends up placed outside of u's list of bins (which should happen most of the time), then it's too far away, and we don't need to compute D. This is somewhat of a off-the-top-of-my-head suggestion, and I don't know if it'd work well (e.g., there would be overhead in computing sufficiently close bins, which might be too costly), so I'm after feedback.
Based on your description of the problem, I would choose an R-tree as your data structure.
It allows for very fast searching by narrowing the set of vertices you need to run D against drastically. However, in the worst-case insertion, O(n) time is required. Thankfully, you're quite unlikely to hit the worst-case insertion with a typical data set.
I would probably just use a binning approach.
Say we cut the unit square in m x m subsquares (each having side length 1/m of course). Since you place your vertices uniformly at random (or so I assumed), every square will contain n / m^2 vertices on average.
Depending on A1, A2, m and n, you can probably determine the maximum radius you need to check. Say that's less than m. Then, after inserting v, you would need to check the square in which it landed, plus all adjacent squares. Anyway, this is a constant number of squares, so for every insertion you'll need to check O(n / m^2) other vertices on average.
I don't know the best value for m (as said, that depends on A1 and A2), but say it would be sqrt(n), then your entire algorithm could run in O(n) expected time.
EDIT
A small addition: you could keep track of vertices with many neighbors (so with high radius, which extends over multiple squares) and check them for every inserted vertex. There should only be few, so that's no problem.

How to Partition a graph into possibly overlapping parts such that any vertex contained in a part has at least distance k from the Boundary?

How to partition a graph into possibly overlapping parts such that any vertex is contained in a part at which it has at least distance k from the Boundary?
The problem arises in cases where the whole graph can not be loaded into a single machine
because there is not sufficient memory. So another requirement is that the partition has
somehow an equal number of vertices.
Are there any algorithms that try to minimize the common vertices between parts?
The use case here is this: You want to perform a query starting from an initial vertex that you know will require maximum k traversals. Having a part that contains all the vertices of this query results in zero
network utilization.
The problem thus is to reduce the memory overhead of such a partition.
Any books I should read?
I found this which looks promising:
http://grafia.cs.ucsb.edu/sedge/docs/sedge-sigmod12-slides.pdf
final edit: It is no coincidence that google decided to use a Hash partition.
Finding a good partition is difficult. I ll go with a hash partition as well and hope
that the data center has good network bandwidth.
You can use a breadth first search to get all the nodes that are only k distance away from the node in question, starting with the node itself. When you reach k away from the origin, you can end the search.
Edit:
Use a depth first search to assign a minimum distance from boundary property to each node. Once you have completed the depth first search, a simple iteration through the nodes should provide the partitions. For example, you can create a table that stores the minimum distance from boundary as the key and a vector of nodes as the value to represent the partition.

Resources