XGBoost tuning with caret gets worse with number of iterations

XGBoost tuning with caret gets worse with number of iterations - performance

When tuning hyperparameters I see that the RMSE gets larger with a greater number of iterations. This is the exact opposite of what I was expecting. Could it be that the data is too noisy for sequential learning trees? My data set is huge with a lot of very small and some very large numbers so I don't think posting a representative sample would be helpful/possible. I am just wondering what is the likely cause for the trend with iterations # that we see in the plots?

Y-axis decreases with height. The graph is as expected.

Related

Performance and scalability of applications in parallel computers

See the picture that is part of the Advanced Computer Architecture by Hwang which talks about the scalability of performance in parallel processing.
The questions are
1- Regarding figure (a), what are the examples for theta (exponential) and alpha (constant)? Which workloads grow exponentially by increasing the number of machines? Also, I haven't seen a constant workload when working with multi cores/computers.
2- Regarding figure (b), why the efficiency of exponential workloads are the highest? Can not understand that!
3- Regarding figure (c), what does fixed-memory model mean? A fixed memory workloads sounds like alpha which is noted as fixed-load model.
4- Regarding figure (c), what does fixed-time model mean? The term "fixed" is misguiding I think. I interpret that as "constant". The text says that fixed-time model is actually the linear in (a) gamma.
5- Regarding figure (c), why exponential model (memory bound) doesn't hit the communication bound?
You may see the text describing the figure below.
I also have to say that I don't understand the last line Sometimes, even if minimum time is achieved with mere processors, the system utilization (or efficiency) may be very poor!!
Can some one shed a light with some examples on that?

Workload refers to the input size or problem size, which is basically the amount of data to be processed. Machine size is the number of processors. Efficiency is defined as speedup divided by the machine size. The efficiency metric is more meaningful than speedup1. To see this, consider for example a program that achieves a speedup of 2X on a parallel computer. This may sound impressive. But if I also told you that the parallel computer has 1000 processors, a 2X speedup is really terrible. Efficiency, on the other hand, captures both the speedup and the context in which it was achieved (the number of processors used). In this example, efficiency is equal to 2/1000 = 0.002. Note that efficiency ranges between 1 (best) and 1/N (worst). If I just tell you that the efficiency is 0.002, you'd immediately realize that it's terrible, even if I don't tell you the number of processors.
Figure (a) shows different kinds of applications whose workloads can change in different ways to utilize a specific number of processors. That is, the applications scale differently. Generally, the reason you add more processors is to be able to exploit the increasing amount of parallelism available in larger workloads. The alpha line represents an application with a fixed-size workload, i.e, the amount of parallelism is fixed so adding more processors will not give any additional speedup. If the speedup is fixed but N gets larger, then the efficiency decreases and its curve would look like that of 1/N. Such an application has zero scalability.
The other three curves represent applications that can maintain high efficiency for with increasing number of processors (i.e., scalable) by increasing the workload in some pattern. The gamma curve represents the ideal workload growth. This is defined as the growth that maintains high efficiency but in a realistic way. That is, it does not put too much pressure on other parts of the system such as memory, disk, inter-processor communication, or I/O. So scalability is achievable. Figure (b) shows the efficiency curve of gamma. The efficiency slightly deteriorates due to the overhead of higher parallelism and due to the serial part of the application whose execution time does not change. This represents a perfectly scalable application: we can realistically make use of more processors by increasing the workload. The beta curve represents an application that is somewhat scalable, i.e., good speedups can be attained by increasing the workload but the efficiency deteriorates a little faster.
The theta curve represents an application where very high efficiency can be achieved because there is so much data that can be processed in parallel. But that efficiency can only be achieved theoretically. That's because the workload has to grow exponentially, but realistically, all of that data cannot be efficiently handled by the memory system. So such an application is considered to be poorly scalable despite of the theoretical very high efficiency.
Typically, applications with sub-linear workload growth end up being communication-bound when increasing the number of processors while applications with super-linear workload growth end up being memory-bound. This is intuitive. Applications that process very large amounts of data (the theta curve) spend of most of their time processing the data independently with little communication. On the other hand, applications that process moderate amounts of data (the beta curve) tend to have more communication between the processors where each processor uses a small amount of data to calculate something and then shares it with others for further processing. The alpha application is also communication-bound because if you use too many processors to process the fixed amount of data, then the communication overhead will be too high since each processor will operate on a tiny data set. The fixed-time model is called so because it scales very well (it takes about the same amount of time to process more data with more processors available).
I also have to say that I don't understand the last line Sometimes,
even if minimum time is achieved with mere processors, the system
utilization (or efficiency) may be very poor!!
How to reach the minimum execution time? Increase the number of processors as long as the speedup is increasing. Once the speedup reaches a fixed value, then you've reached the number of processors that achieve the minimum execution time. However, efficiency might be very poor if the speedup is small. This follows naturally from the efficiency formula. For example, suppose that an algorithm achieves a speedup of 3X on a 100-processor system and increasing the number of processors further will not increase the speedup. Therefore, the minimum execution time is achieved with a 100 processors. But efficiency is merely 3/100= 0.03.
Example: Parallel Binary Search
A serial binary search has an execution time equal to log2(N) where N is the number of elements in the array to be searched. This can be parallelized by partitioning the array into P partitions where P is the number of processors. Each processor then will perform a serial binary search on its partition. At the end, all partial results can be combined in serial fashion. So the execution time of the parallel search is (log2(N)/P) + (C*P). The latter term represents the overhead and the serial part that combines the partial results. It's linear in P and C is just some constant. So the speedup is:
log2(N)/((log2(N)/P) + (C*P))
and the efficiency is just that divided by P. By how much the workload (the size of the array) should increase to maintain maximum efficiency (or making the speedup as close to P as possible)? Consider for example what happens when we increase the input size linearly with respect to P. That is:
N = K*P, where K is some constant. The speedup is then:
log2(K*P)/((log2(K*P)/P) + (C*P))
How does the speedup (or efficiency) change as P approaches infinity? Note that the numerator has a logarithm term, but the denominator has a logarithm plus a polynomial of degree 1. The polynomial grows exponentially faster than the logarithm. In other words, the denominator grows exponentially faster than the numerator and the speedup (and hence the efficiency) approaches zero rapidly. It's clear that we can do better by increasing the workload at a faster rate. In particular, we have to increase exponentially. Assume that the input size is the of the form:
N = KP, where K is some constant. The speedup is then:
log2(KP)/((log2(KP)/P) + (C*P))
= P*log2(K)/((P*log2(K)/P) + (C*P))
= P*log2(K)/(log2(K) + (C*P))
This is a little better now. Both the numerator and denominator grow linearly, so the speedup is basically a constant. This is still bad because the efficiency would be that constant divided by P, which drops steeply as P
increases (it would look like the alpha curve in Figure (b)). It should be clear now the input size should be of the form:
N = KP2, where K is some constant. The speedup is then:
log2(KP2)/((log2(KP2)/P) + (C*P))
= P2*log2(K)/((P2*log2(K)/P) + (C*P))
= P2*log2(K)/((P*log2(K)) + (C*P))
= P2*log2(K)/(C+log2(K)*P)
= P*log2(K)/(C+log2(K))
Ideally, the term log2(K)/(C+log2(K)) should be one, but that's impossible since C is not zero. However, we can make it arbitrarily close to one by making K arbitrarily large. So K has to be very large compared to C. This makes the input size even larger, but does not change it asymptotically. Note that both of these constants have to be determined experimentally and they are specific to a particular implementation and platform. This is an example of the theta curve.
(1) Recall that speedup = (execution time on a uniprocessor)/(execution time on N processors). The minimum speedup is 1 and the maximum speedup is N.

Algorithms for efficiently computing logistic map

The logistic map is a classic example where floating point numbers fail. It's also a great example of where error propagates very badly in general in numerical algorithms even when dealing with bignums. I was wondering if there are any known algorithms for taming this issue? Is there an efficient way to compute a logistic map that doesn't require naively computing it with huge precision?

It is a classic example because it is a chaotic system. The entire point of a chaotic system is that it shows unbelievable sensitivity to initial conditions. To get an answer within 5% of correct after n iterations requires starting with O(n) digits of the number. Not because your algorithm is bad, but because changing any of those digits changes what the answer should be.
So, no. While you can potentially speed up the calculation somewhat, you can't get away with starting with lower precision.

What algorithms have high time complexity, to help "burn" more CPU cycles?

I am trying to write a demo for an embedded processor, which is a multicore architecture and is very fast in floating point calculations. The problem is that the current hardware I have is the processor connected through an evaluation board where the DRAM to chip rate is somewhat limited, and the board to PC rate is very slow and inefficient.
Thus, when demonstrating big matrix multiplication, I can do, say, 128x128 matrices in a couple of milliseconds, but the I/O takes (lots of) seconds kills the demo.
So, I am looking for some kind of a calculation with higher complexity than n^3, the more the better (but preferably easy to program and to explain/understand) to make the computation part more dominant in the time budget, where the dataset is preferably bound to about 16KB per thread (core).
Any suggestion?
PS: I think it is very similar to this question in its essence.

You could generate large (256-bit) numbers and factor them; that's commonly used in "stress-test" tools. If you specifically want to exercise floating point computation, you can build a basic n-body simulator with a Runge-Kutta integrator and run that.

What you can do is
Declare a std::vector of int
populate it with N-1 to 0
Now keep using std::next_permutation repeatedly until they are sorted again i..e..next_permutation returns false.
With N integers this will need O(N !) calculations and also deterministic

PageRank may be a good fit. Articulated as a linear algebra problem, one repeatedly squares a certain floating-point matrix of controllable size until convergence. In the graphical metaphor, one "ripples" change coming into each node onto the other edges. Both treatments can be made parallel.

You could do a least trimmed squares fit. One use of this is to identify outliers in a data set. For example you could generate samples from some smooth function (a polynomial say) and add (large) noise to some of the samples, and then the problem is to find a subset H of the samples of a given size that minimises the sum of the squares of the residuals (for the polynomial fitted to the samples in H). Since there are a large number of such subsets, you have a lot of fits to do! There are approximate algorithms for this, for example here.

Well one way to go would be to implement brute-force solver for the Traveling Salesman problem in some M-space (with M > 1).
The brute-force solution is to just try every possible permutation and then calculate the total distance for each permutation, without any optimizations (including no dynamic programming tricks like memoization).
For N points, there are (N!) permutations (with a redundancy factor of at least (N-1), but remember, no optimizations). Each pair of points requires (M) subtractions, (M) multiplications and one square root operation to determine their pythagorean distance apart. Each permutation has (N-1) pairs of points to calculate and add to the total distance.
So order of computation is O(M((N+1)!)), whereas storage space is only O(N).
Also, this should not be either too hard, nor too intensive to parallelize across the cores, though it does take some overhead. (I can demonstrate, if needed).

Another idea might be to compute a fractal map. Basically, choose a grid of whatever dimensionality you want. Then, for each grid point, do the fractal iteration to get the value. Some points might require only a few iterations; I believe some will iterate forever (chaos; of course, this can't really happen when you have a finite number of floating-point numbers, but still). The ones that don't stop you'll have to "cut off" after a certain number of iterations... just make this preposterously high, and you should be able to demonstrate a high-quality fractal map.
Another benefit of this is that grid cells are processed completely independently, so you will never need to do communication (not even at boundaries, as in stencil computations, and definitely not O(pairwise) as in direct N-body simulations). You can usefully use O(gridcells) number of processors to parallelize this, although in practice you can probably get better utilization by using gridcells/factor processors and dynamically scheduling grid points to processors on an as-ready basis. The computation is basically all floating-point math.
Mandelbrot/Julia and Lyupanov come to mind as potential candidates, but any should do.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?

The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).

use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.

You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.

No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.

BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.

One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

Using a smoother with the L Method to determine the number of K-Means clusters

Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use?
The "L-Method" is detailed in:
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan
This calculates the evaluation metric for a range of different trial cluster counts. Then, to find the knee (which occurs for an optimum number of clusters), two lines are fitted using linear regression. A simple iterative process is applied to improve the knee fit - this uses the existing evaluation metric calculations and does not require any re-runs of the k-means.
For the evaluation metric, I am using a reciprocal of a simplified version of the Dunns Index. Simplified for speed (basically my diameter and inter-cluster calculations are simplified). The reciprocal is so that the index works in the correct direction (ie. lower is generally better).
K-means is a stochastic algorithm, so typically it is run multiple times and the best fit chosen. This works pretty well, but when you are doing this for 1..N clusters the time quickly adds up. So it is in my interest to keep the number of runs in check. Overall processing time may determine whether my implementation is practical or not - I may ditch this functionality if I cannot speed it up.

I had asked a similar question in the past here on SO. My question was about coming up with a consistent way of finding the knee to the L-shape you described. The curves in question represented the trade-off between complexity and a fit measure of the model.
The best solution was to find the point with the maximum distance d according to the figure shown:
Note: I haven't read the paper you linked to yet..

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio