LightGBM selection of target for regression (GPU) - lightgbm

I compiled/installed the GPU variation of LightGBM and I am able to run the regression example.
Now I would like to use another dataset for regression.
How do I specified which column is the target and which ones are the predictors?

Related

Machine learning algorithm for correlation between indicators

I have a dataset with several indicators related to some geographical entities ,I want to study factors that influence an indicator A (among the other indicator) .I need to determine which indicators affect it the most (correlation)
which ML algo should I use
I want to have a kind of scoring function for my indicator A to allow its prediction
enter image description here
What you are looking for are correlation coefficients, you have multiple choices for that, the most commons are:
Pearson's coefficient which only measure linear relationship between two variables, see [Scipy's implementation]
Spearman's coefficient which can show non-linear relationship , see Scipy's implementation
You can also normalize your data using z-normalization and then do a simple Linear regression. The regression coefficient can give you an idea of the influence of each variable on the outcome. However this method is highly sensible to multi-collinearity which might be present, especially if your variables are geographical.
Could you provide an example of the dataset? Discrete or continuous variables? Which software are you using?
Anyway an easy way to test correlation (without going into ML algorithms in the very sense) is to simply perform Pearson's or Spearman's correlation coefficient on selected features or on the whole dataset by creating a matrix of the data. You can do that in Python with NumPy (see this) or in R (see this).
You can also use simple linear regression or logistic/multinomial logistic regression (depending on the nature of your data) to quantify the influence of the other features on your target variables. Just keep in mind that "correlation is not causation. Look here to see some models.
Then it depends on the object of your analysis whether to aggregate all the features of all the geographical points or create covariance matrices for each "subset" of observation related to the geographical points.

LightGBM: Intent of lightgbm.dataset()

What is the purpose of lightgbm.Dataset() as per the docs when I can use the sklearn API to feed the data and train a model?
Any real world examples explaining the usage of lightgbm.dataset() would be interesting to learn?
LightGBM uses a few techniques to speed up training which require preprocessing one time before training starts.
The most important of these is bucketing continuous features into histograms. When LightGBM searches splits to possibly add to a tree, it only searches the boundaries of these histogram bins. This greatly reduces the number of splits to evaluate.
I think this picture from "What Makes LightGBM Fast?" describes it well:
The Dataset object in the library is where this preprocessing happens. Histograms are created one time, and then don't need to be calculated again for the rest of training.
You can get some more information about what happens in the Dataset object by looking at the parameters that control that Dataset, available at https://lightgbm.readthedocs.io/en/latest/Parameters.html#dataset-parameters. Some examples of other tasks:
optimization for sparse features
filtering out features that are not splittable
when I can use the sklearn API to feed the data and train a model
The lightgbm.sklearn interface is intended to make it easy to use LightGBM alongside other libraries like xgboost and scikit-learn. It takes in data in formats like scipy sparse matrices, pandas data frames, and numpy arrays to be compatible with those other libraries. Internally, LightGBM constructs a Dataset from those inputs.

How to implement Breusch-Godfrey test for a regression with ARIMA errors in R

I’m fitting a regression with ARIMA errors with the fable package and as mentioned im my previous question the Breusch-Godfrey test is not available there.
The regression part of the model has two pairs of Fourier terms to account for yearly seasonality and several exogenous regressors. The residuals are modeled with a seasonal ARIMA(2,0,0)(1,0,0)[7] model. My goal is to check for autocorrelation in residuals.
I can use the Ljung-Box test but according to this thread and textbook sources there it will not be valid in presence of lags of the dependent variable.
And I’m afraid i will loose my model specification using different packages/libraries. An alternative might be to use Arima from the forecast package and retain model specification. Then use bgtest from lmtest package. But I can’t figure out how to do this.
According to this R forum the Breusch-Godfrey test for an ARIMA model can be done by fitting a simple regression of the residuals from the fitted model on a constant and then perform a bgtest. But it only concerns a simple AR(1) model with no exogenous regressors.
Is this the right way to do it? I’m concerned that for the BG test you have to perform an auxiliary regression on the regressors and lagged resuduals up to order p. How in this case the bgtest knows the X variables since they are not stored in the residuals object - this should be a simple vector.

Will non-linear regression algorithms perform better if trained with normally distributed target values?

After finding out about many transformations that can be applied on the target values(y column), of a data set, such as box-cox transformations I learned that linear regression models need to be trained with normally distributed target values in order to be efficient.(https://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va)
I'd like to know if the same applies for non-linear regression algorithms. For now I've seen people on kaggle use log transformation for mitigation of heteroskedasticity, by using xgboost, but they never mention if it is also being done for getting normally distributed target values.
I've tried to do some research and I found in Andrew Ng's lecture notes(http://cs229.stanford.edu/notes/cs229-notes1.pdf) on page 11 that the least squares cost function, used by many algorithms linear and non-linear, is derived by assuming normal distribution of the error. I believe if the error should be normally distributed then the target values should be as well.
If this is true then all the regression algorithms using least squares cost function should work better with normally distributed target values.
Since xgboost uses least squares cost function for node splitting(http://cilvr.cs.nyu.edu/diglib/lsml/lecture03-trees-boosting.pdf - slide 13) then maybe this algorithm would work better if I transform the target values using box-cox transformations for training the model and then apply inverse box-cox transformations on the output in order to get the predicted values.
Will this theoretically speaking give better results?
Your conjecture "I believe if the error should be normally distributed then the target values should be as well." is totally wrong. So your question does not have any answer at all since it is not a valid question.
There are no assumptions on the target variable to be Normal at all.
Getting the target variable transformed does not mean the errors are normally distributed. In fact, that may ruin normality.
I have no idea what this is supposed to mean: "linear regression models need to be trained with normally distributed target values in order to be efficient." Efficient in what way?
Linear regression models are global models. They simply fit a surface to the overall data. The operations are matrix operations, so the time to "train" the model depends only on the size of data. The distribution of the target has nothing to do with model building performance. And, it has nothing to do with model scoring performance either.
Because targets are generally not normally distributed, I would certainly hope that such a distribution is not required for a machine learning algorithm to work effectively.

Neural Network with correlated features

Is there a Neural Network algorithm that supports adding features on the fly (non-fixed feature set) and where it does not assume features isn't correlated with each other?
I don't think you can add features on fly, becouse NN as many other algorithm work with vector of input vector with same size, although it is sparse vectors. You can train with one feature set, then store weights add new features and start new training I think it will coverege much faster than first one.
NN(of first order) is work like Logistic regression and solve problem for global maximum, there are no assumption about features at all, just finding function which is related to probabilistic distribution which maximize likehood of training data, unlike Naive Bayes where each propability is calulcated separetly and then they combined with independence assumption.

Resources