LGBM Decision Tree Split - Target Continuous variable - lightgbm

How LGBM builds decision tree for continuous target variable. How gain is calculated for columns of continuous target variable?

Related

Can a conjunctive/disjunctive normal form be represented in a binary tree?

I could find this relative question
Distributing AND over OR in a binary tree (Conjunctive Normal Form)
I'm not quite sure what would be the out come of the CNF binary tree representation for the this expression.
A & B & C
AND
|- A
|- AND
|-B
|-C
Is this right? My basic question is does the CNF binary representation can have multiple AND nodes in the tree rather than just one AND node as root. My understanding is we could have non-root AND nodes as long as its parent is an AND node.
Related question is? Is this representation optimal? or representing them in a n-nary tree with just one root AND node is beneficial? The optimality I'm looking here is wrt building and traversal of the tree.
// Edit based on the comment.
For the sake of simplicity, assume that the not (~) operator is part of leaf nodes A, B or C. That means you need to worry about ~ operator being part of the non-leaf nodes which might change the tree structure when expanded as per Demorgan's law.
A minimum BDD for your conjunction:
The BDD was created using this online tool.
For a simple AND with three inputs, there is nothing to optimize. The BDD nodes build a chain.

Comparing trees with different leaf sets (different number and label of leaf nodes)

I have hierarchical data from file/ folder structures which i use to build trees. I am now trying to compare these trees with random ones and amongst themselves.
To compare to random trees I can preserve the number and label of the leaf nodes and use traditional tree distance metrics (For instance Robinson-Foulds distance). Nonetheless to compare different trees from different data ( with different number of leaves and labels ) I have no idea which metric/ algorithm to use. Any suggestions?
thanks!
PS- the goal with the comparison would be to establish how similar is the topology between these trees and see which clusters may exist (and hence add some piece of evidence on the thoughts of the generating mechanisms behind the folder structure).

Algorithm for high level planning in artificial intelligence

I am currently working on an artificial intelligence project in which an agents needs to push and pull boxes from their original position to a certain goal position. The project will then be expanded to include multiple agents, so we have a supervisor that takes care of creating "high-level" goals, while the agents take care of the actual implementations.
In practice, for the moment, the supervisor should decide the order in which the boxes should be put on goal position. In fact, it could happen that putting a box in its goal position could block the path to another goal.
Our first approach to solve this problem is trying to consider "cut-positions". A certain position is a cut position if it divides the walkable space into two subsets, in which one of those we have the agent, and in the other we have one or more goals. For example, consider the following level in which "x" is the agent, "A" and "B" are boxes and "a" and "b" are the respective goal positions:
+++++++++++++++++++++++++++++++++++++++++
x a b+
+++++ +++++++++++++++++++++++++++++++++
+AB +
+++++
In this case the position of goal "a" is a cut position, because if a box is put there, then the agent will not be able to reach goal "b".
Can you suggest a fast algorithm to compute the cut positions, and that maybe returns the number of goals that each cut position is blocking?
What you call a cut position for your grid word is called a cut vertex or articulation point in general graphs. From Wikipedia:
Specifically, a cut vertex is any vertex whose removal increases the number of connected components.
And a bit further down in the same article:
The classic sequential algorithm for computing biconnected components in a connected undirected graph due to John Hopcroft and Robert Tarjan (1973) [1] runs in linear time, and is based on depth-first search. This algorithm is also outlined as Problem 22-2 of Introduction to Algorithms (both 2nd and 3rd editions).
Having determined the biconnected components, it should be quite easy to determine the articulation points: All nodes which are contained in more than one bi-connected component are articulation points.
You could put the area into an undirected graph, where each node is a position of the map and two nodes are connected if the positions are adjacent to each other. Then you can mark those 'cut positions' on the graph and see all the paths that would be blocked by a box on a cut position.

Strategy to build test graphs for Dijkstra's algorithm?

I recently implemented Dijkstra's algorithm to practice Java. I'm now considering how to build random test graphs (with unidirectional edges).
Currently, I use a naive method. Nodes are created at random locations in 2d space (where x and y are unsigned integers between 0 and some MAX_SPACE constant). Edges are randomly created to connect the nodes, so that each node has an outdegree of at least 1 (and at most MAX_DEGREE). Indegree is not enforced. Then I search for a path between the first and last Nodes in the set, which may or may not be connected.
In a more realistic situation, nodes would have a probability of being connected proportional to their proximity in 2d space. What is a good strategy to build random test graphs with that property?
NOTES
I will primarily use this to build graphs that can be drawn and verified by hand, but scaling to larger graphs is a consideration.
The strategy should be easily modified to support the following constants (and maybe others -- let me know if you think of any interesting ones):
MIN_NODES, MAX_NODES: a range of sizes for the graph
CONNECTEDNESS: average out-degree
PROXIMITY: weight given to preferring to connect proximal nodes
You could start by looking at the different random graph generators available in JUNG (Java library):
Barabasi Albert Generator - Simple evolving scale-free random graph generator. At each time step, a new vertex is created and is connected to existing vertices according to the principle of "preferential attachment", whereby vertices with higher degree have a higher probability of being selected for attachment.
Eppstein Power Law Generator - Graph generator that generates undirected graphs with power-law degree distributions.
There are various other generators available to - See Listing Here
For python there is the NetworkX library that also provides many graph generators - Listed Here
With many of these generators you can specify the size, so you can start small and go from there.

How does the make "-j" option actually work?

From the man pages:
-j [jobs], --jobs[=jobs]
Specifies the number of jobs (commands) to run simultaneously.
If there is more than one -j option,
the last one is effective. If the -j
option is
given without an argument, make will not limit the number of jobs
that can run simultaneously.
I know it uses a graph dependency to know which rules are independent.
I would like to know how this graph is built and understand what are the criteria used.
thanks.
The dependency graph is based, as one would expect, on the prerequisites listed for each Makefile target. make will build a graph where the targets and prerequisites are vertices and there is a directed edge from the prereqs to their targets. In this way the number of incoming edges tells you how many prereqs a target has. If it has no incoming edges, then it has no prerequisites.
The vertices for the .c and .h files, for example, will have no incoming edges. Those files are your source files and do not need to be built.
It then performs a topological sort on the graph to determine the order of execution. From Wikipedia:
The canonical application of topological sorting (topological order) is in scheduling a sequence of jobs or tasks; topological sorting algorithms were first studied in the early 1960s in the context of the PERT technique for scheduling in project management (Jarnagin 1960). The jobs are represented by vertices, and there is an edge from x to y if job x must be completed before job y can be started (for example, when washing clothes, the washing machine must finish before we put the clothes to dry). Then, a topological sort gives an order in which to perform the jobs.
The gist of a topological sort is to find the vertices with no incoming edges (no dependencies) and put those first. Then remove them from the graph. Now you'll have a new set of vertices with no incoming edges (no dependencies). Those are next. And so on until finished. (If you ever reach a point when there are no such vertices then the dependency graph contains a cycle, which is an error condition.)
In a typical Makefile this means you'll first build the source files (nothing needs to be done). Then the object files that depend on those source files. Then the libraries and executables built from those object files.
Under normal non-parallel operation make will simply pick a single target each iteration and build it. When it is parallel it will grab as many dependency-less targets as it can and build them in parallel, up to the number of permitted simultaneous jobs.
So when make gets to, say, the object file step, it will have a large number of vertices in the graph that all have no incoming edges. It knows it can build the object files in parallel and so it forks off n copies of gcc to build the object files.
I suspect you're expecting something more magical than there actually is here. Makefiles contain lines like:
target: prereq1 prereq2 prereq3 ...
This defines a relationship between files in the system; in graph parlance, each white-space-separated word on the line implicitly declares a node in the graph, and a directed edge is created between each of the nodes to the left of the colon and each of the nodes to the right of the colon, pointing from the latter to the former.
From there it's a simple matter of traversing the graph to find nodes that have no incoming edges and executing the commands associated with those nodes, then working back 'up' the graph from there.
Hope that helps.

Resources