Xgboost single tree split history plot - cart

See CART performance per node plot
I used other CART package which produces the plot of performance per each added node.
It's like x-axis for the number of nodes and y-axis is the log loss for binary target.
Just wondered if xgboost has a similar functionality for a single tree or the tree specified.
I know xgboost has the similar plotting functionality on each training epoch. I need that on a single tree with adding/deleting nodes.

Related

Large graph processing on one machine

What are the algorithms for processing a large graph on a regular user computer (conventionally 8-16 GB RAM).
There is a task to process a sufficiently large graph (to calculate the PageRank) which does not fit completely into the operative memory under the given conditions.
I would like to know what algorithms exist for this or in which direction is it better to start studying? As I understand it, the algorithms for dividing the graph can help me with this, but it is not very clear how, given that it is not possible to build the entire graph in the program at once.
Perhaps there are algorithms for calculating the PageRank for each separate part of the graph and then combining the counting results.
UPD:
More substantive. There is a problem of calculating Pagerank on a large graph. The counting is done in a Python program.
A graph is built based on the data using networkx and the PageRank calculation will be performed using the same networkx. The problem is that there are RAM limitations, the entire graph does not fit into memory.
So I wonder if there are any algorithms that would allow me to calculate PageRank for graphs smaller (subgraph?) than the original one?
Generally speaking, if a graph is too large to fit into memory, then it must be partitioned into multiple partitions.
Suppose the vertices can fit into memory and edges reside on disks. Each time the program loads a partition of the edge into the memory, calculate the PageRank, and then loads another partition. The xstream gives a good solution to this case: http://sigops.org/s/conferences/sosp/2013/papers/p472-roy.pdf
A more complicated case is that both vertices and edges cannot fit into memory, then both of them needs to be loaded into memory many times. The grid graph gives a good solution to this case: https://www.usenix.org/system/files/conference/atc15/atc15-paper-zhu.pdf

How to improve Dagre js calculation performance

I'm using Dagre to generate graph coordinates on the frontend for a graph of around 700 nodes & 700 edges and it is currently taking around 1.5 to 2 seconds to generate (this is before rendering). How should I go about optimising this, are there any known ways to speed it up?
For example, I already know the graph is directed, acyclic & topo-sorted (validated in API), so could somehow skip this part of the algorithm?
Another approach could be to try and reduce the size of the graph in the first place, by 'clustering' closed groups as per the diagram below (which could then be expanded on click in the ui). Any known algorithms to achieve this?

Clustering elements based on highest similarity

I'm working with Docker images which consist of a set of re-usable layers. Now given a collection of images, I would like to combine images which have a large amount of shared layers.
To be more exact: Given a collection of N images, I want to create clusters where all images in a cluster share more than X percent of services with eachother. Each image is only allowed to belong to one cluster.
My own research points in the direction of cluster algorithms where I use a similarity measure to decide which images belong in a cluster together. The similarity measure I know how to write. However, I'm having difficulty finding an exact algorithm or pseudo-algorithm to get started.
Can someone recommend an algorithm to solve this problem or provide pseudo-code please?
EDIT: after some more searching I believe I'm looking for something like this hierarchical clustering ( https://github.com/lbehnke/hierarchical-clustering-java ) but with a threshold X so that neighbors with less than X% similarity don't get combined and stay in a separate cluster.
I believe you are a developer and you have no experience with data science?
There are a number of clustering algorithms and they have their advantages and disadvantages (please consult https://en.wikipedia.org/wiki/Cluster_analysis), but I think solution for your problem is easier than one can think.
I assume that N is small enough so you can store a matrix with N^2 float values in RAM memory? If this is the case, you are in a very comfortable situation. You write that you know how to implement similarity measure, so just calculate the measure for all N^2 pairs and store it in a matrix (it is a symmetric matrix, so only half of it can be stored). Please ensure that your similarity measure assigns special value for pair of images, where similarity measure is less than some X%, like 0 or infinity (it depends on that you treat a function like similarity measure or like a distance). I think perfect solution is to assign 1 for pairs, where similarity is greater than X% threshold and 0 otherwise.
After that, treat is just like a graph. Get first vertex and make, e.g., deep first search or any other graph walking routine. This is your first cluster. After that get first not visited vertex and repeat graph walking. Of course you can store graph as an adjacency list to save memory.
This algorithm assumes that you really do not pay attention to that how much images are similar and which pairs are more similar than other, but if they are similar enough (similarity measure is greater than a given threshold).
Unfortunately in cluster analysis it is common that 100% of possible pairs has to be computed. It is possible to save some number of distance calls using some fancy data structures for k-nearest neighbor search, but you have to assure that your similarity measure hold triangle inequality.
If you are not satisfied with this answer, please specify more details of your problem and read about:
K-means (main disadvantage: you have to specify number of clusters)
Hierarchical clustering (slow computation time, at the top all images are in one cluster, you have to cut a dendrogram at proper distance)
Spectral clustering (for graphs, but I think it is too complicated for this easy problem)
I ended up solving the problem by using hierarchical clustering and then traversing each branch of the dendrogram top to bottom until I find a cluster where the distance is below a threshold. Worst case there is no such cluster but then I'll end up in a leaf of the dendrogram which means that element is in a cluster of its own.

Applying K-means clustering on Z-score Normalized Data

I've been working to understand how to apply k-means clustering to a small set of data for a list of companies.
The mean and standard deviation is given so that I can determine the normalized data.
For example, I have the following:
From my understanding of k-means clustering, I have to randomly find the centroids, where k = 3. I have to keep adjusting the centroid locations until no more movements are possible, that is the data remains the same after a certain result is met.
I am having difficulty applying these procedures to my data set. I've watched and searched for many examples on how to accomplish this, step by step, but I haven't had any success that allows me to understand.
Basically what I am suppose to do is show a scatter plot at each adjustment to the centroid.
I believe that I have to calculate the distance between two data items using the Euclidean distance algorithm, but does that mean the distance between z-score sales and z-score fuel, or what? This is why I am lost, even after I've read through about a dozen powerpoints and watched multiple videos.
This seems to be the best example I've come across, but even then, I'm still a bit lost due to my example being slightly different than the one introduced: http://www.indiana.edu/~dll/Q530/Q530_kk.pdf
The most progress I've made was coming across a variety of data mining software, such as WEKA, Orange, various Excel add-ons such as XLMiner, etc. However, they seem to provide the end result, not the procedures required to get there.
Any help is appreciated. If more information is needed, please let me know.
Thank you.
Edit: I've found some more solutions and thought I should add in the event anyone runs into the same issues.
1) I calculated the Euclidean distance using this Excel formula mentioned on this video: http://www.lynda.com/Excel-tutorials/Calculating-distance-centroid/165438/175003-4.html
This is what the formula looks like: =SQRT((B28-$B$52)^2+(C28-$C$52)^2) keeping in mind that each cell represents where your data is contained.
In this case my cells are listed in the image here: http://i.imgur.com/W44km64.png
This has given me the following table: http://i.imgur.com/miTiVj5.png
You are right on with the process. Personally, I'd view your data as 2D just the (x,y) that are Sales and Fuel Cost... though you could use all 4 and just have 4D points instead.
Step 1: Either pick random centers (3 of them c_1, c_2, c_3), or split up your data into 3 random clusters. If you randomly split the data into 3 clusters, you then compute the mean of all the points in each cluster. Those 3 means become the three centers. (Here by mean, I mean the average of each coordinate... think of them as vectors and average the vector.)
Step 2: Each center represents one of the three clusters. For each point, compute the distance to each center (this could be Euclidean distance, or any other distance metric). Each point is moved into the cluster whose center is the closest. I.e. if point i is closest to center j, then regardless of which cluster point i was in, it moves to cluster j. Keep track of whether or not any point moves to a new cluster. This is used as a part of your stopping condition in Step 3.
Step 3: After all the points have moved to the cluster nearest them, recompute the centers by averaging together all the points in each cluster. Then, go back to 2 and repeat until no points change which cluster they are in.

Reduce the data points on a graph?

I have a huge graph, thousands of data points.
Plotting the graph as is will create a mess with too many lines.
Question: what's the best way to reduce the data points?
Example: say my graph is 1000 data points and I need to bring this to 100.
I tried:
a) take 10 of the data points and create a data point based on the average of those. This method produced terrible results and the graph seemed like something else.
b) take the 1st of the 10 data points. This was better than a, but still the graph was different.
There is Douglas-Peucker algorithm to simplify curves, removing some points, while preserving overall form of curve.
(Note that the remaining points will be distributed slightly unevenly)

Resources