LightGBM: Intent of lightgbm.dataset() - lightgbm

What is the purpose of lightgbm.Dataset() as per the docs when I can use the sklearn API to feed the data and train a model?
Any real world examples explaining the usage of lightgbm.dataset() would be interesting to learn?

LightGBM uses a few techniques to speed up training which require preprocessing one time before training starts.
The most important of these is bucketing continuous features into histograms. When LightGBM searches splits to possibly add to a tree, it only searches the boundaries of these histogram bins. This greatly reduces the number of splits to evaluate.
I think this picture from "What Makes LightGBM Fast?" describes it well:
The Dataset object in the library is where this preprocessing happens. Histograms are created one time, and then don't need to be calculated again for the rest of training.
You can get some more information about what happens in the Dataset object by looking at the parameters that control that Dataset, available at https://lightgbm.readthedocs.io/en/latest/Parameters.html#dataset-parameters. Some examples of other tasks:
optimization for sparse features
filtering out features that are not splittable
when I can use the sklearn API to feed the data and train a model
The lightgbm.sklearn interface is intended to make it easy to use LightGBM alongside other libraries like xgboost and scikit-learn. It takes in data in formats like scipy sparse matrices, pandas data frames, and numpy arrays to be compatible with those other libraries. Internally, LightGBM constructs a Dataset from those inputs.

Related

Clarifying statsmodels AutoReg(), ARMA() and SARIMAX() for time-series forecasting

I am buidling my first time-series prediction model with scikit-learn's LinearRegression(). I also came across statsmodels AutoReg(), ARMA() and SARIMAX(). Unfortunately out of the literature I could not figure out to consider them. Are they alternatives to LinearRegression()? Are they ML? Are they fundamental different?
I'd appreciate a hint, where to look further. Thanks.
All three fit variants of Seasonal Autoregressive Integrated Moving Average with eXogenous Variables (SARIMAX) models.
AutoReg
AutoReg is limited to only Autoregressive Models and so does not include Seasonal or Moving Average components. It does support exogenous regressors. It also supports complex deterministic processes such as Fourier series to model multiple seasonalities. Parameters are estimated using OLS which is equivalent to conditional maximum likelihood. Since parameters are estimated using OLS, estimation is very fast and completely deterministic.
ARIMA
ARIMA is a restricted version of SARIMAX that does not include Seasonal components or Exogenous regressors. Because it excludes these two types of terms, it can offer additional fitting options that are not available when fitting a full SARIMAX model. These have different statistical properties than the Maximum Likelihood method that is the only method available in SARIMAX (ARIMA also supports Maximum Likelihood). Many of these alternative parameter estimation methods are also faster than ML.
SARIMAX
SARIMAX supports all features of ARIMA plus the two additional components. It can only be estimated using Maximum Likelihood. ML uses numerical methods to maximize the function and so estimation of some series/models may encounter difficulties converging.
The examples page is the best place to look to see the detailed use of these models. Many of the notebooks include both code examples and LaTeX markup that explains the underlying math.

How to deal with multiple data files using lightGBM

I am trying to use lightGBM as a classifier. My data are saved in multiple csv files, but I find there is no way to directly use multiple files as the input.
I have considered to combine all the data into a big one (numpy array), but my computer doesn't have enough memory. How can I use lightGBM to deal with multiple data files when the avaliable memory is poor?
I guess that you are using Python.
What is the size of your data ? (num of rows x num of columns)
Lightgbm will need to load the data in-memory for training.
But if you haven't done it yet, you can wisely choose a suitable datatype for every column of your data.
It can considerably reduce the memory footprint if you use dtypes such as 'uint8' / 'uint16' and help you load everything in memory.
Sample.
You shouldn't ever (Except certain edge cases) need to use your entire dataset if you sample CORRECTLY.
I use a DB that has over 230M records but I usually only select a RANDOM sample of anywhere from 1k-100k to create the model.
Also, you might as well split your data into training, testing and validation. That will help cut down the size per file.
You might want to categorize your features, then to one-hot-encode them. LightGBM works best with sparse features such as one-hot-encoded ones due to its EFB (Effective Feature Bundling) which enhances computation efficiency of LightGBM significantly. Moreover, you will definitely get rid of the floating parts of the numbers.
Think categorization as that; let’s say that values of one of the numerical features vary between 36 to 56, you can digitize it as [36,36.5,37,....,55.5,56] or [40,45,50,55] to make it categorical. Up to your expertise and imagination. You can refer to scikit-learn for one-hot-encoding, it has built-in function for that.
PS: With a numerical feature, always inspect the statistical properties of it, you can use pandas.describe() which summarizes its mean, max, min, std etc.

Artificial neural network image transformation

I have a pairs of images (input-output) but I don't know the transformation to going from A (input) to B (output). I want to record image A and get image B. Physically I can change the setup to get A or B, but I want to do it by software.
If I understood well, a trained Artificial Neural Network is able to do that, having an input can give the corresponding output, is it right?
Is there any software/ANN that just "training" it with entering a number of input-output pairs will be able to provide the correct output if the input is a new (but similar to the others) image?
Thanks
If you have some relevant amount of image pairs (input/output pair) and you don't know transformation between input and output you could train ANN on that training set to imitate that unknown transformation. You will be able to well train your ANN only if you have sufficient amount of training image pairs, but it could be pretty impossible when that unknown transformation is complicated.
For example if that transformation simply increases intensity values of pixels at input image by given value, ANN will very fast learn to imitate that behavior, but if that unknown transformation is some complicated convolution or few serial convolutions or something more complicated it will be very hard, near impossible to train ANN to imitate that transformation. So, more complex transformation will need bigger training set and more complex ANN design.
There are plenty of free opensource ANN libraries implemented in many languages. You could start for example with that tutorial: http://www.codeproject.com/Articles/13091/Artificial-Neural-Networks-made-easy-with-the-FANN
What you are asking is possible in principle -- in theory, an ANN with sufficiently many hidden units can learn an arbitrary function to map inputs to outputs. However, as the comments and other answers have mentioned, there may be many technical issues with your particular problem that could make it impractical. I would classify these problems as (a) mapping complexity, (b) model complexity, (c) scaling complexity, and (d) implementation complexity. They are all somewhat related, but hopefully this is a useful way to break things down.
Mapping complexity
As mentioned by Springfield762, there are many possible functions that map from one image to another image. If the relationship between your input images and your output images is relatively simple -- like increasing the intensity of each pixel by a constant amount -- then an ANN would be able to learn this mapping without much difficulty. There are probably many more transformations that would be similarly easy to learn, such as skewing, flipping, rotating, or translating an image -- basically any affine transformation would be easy to learn. Other, nonlinear transformations could also be feasible, such as squaring the intensity of each pixel.
As a general rule, the more complicated the relationship between your input and output images, the more difficult it will be to get a model to learn this mapping for you.
Model complexity
The more complex the mapping from inputs to outputs, the more complex your ANN model will be to be able to capture this mapping. Models with many hidden layers have been shown in the past 10 years to perform quite well on tasks that people had previously thought impossible, but often these state-of-the-art models have millions or even billions of parameters and take weeks to train on GPU hardware. A simple model can capture many simple mappings, but if you have a complex input-output map to learn, you'll need a large, complex model.
Scaling complexity
Yves mentioned in the comments that it can be difficult to scale models up to typical image sizes. If your images are relatively small (currently the state of the art is to model images on the order of 100x100 pixels), then you can probably just throw a bunch of raw pixel data at an ANN model and see what happens. But if you're using 6000x4000 images from your shiny Nikon DSLR, it's going to be quite difficult to process those in a reasonable amount of time. You'd be better off compressing your image data somehow (PCA is a common technique) and then trying to learn the mapping in the compressed space.
In addition, larger images will have a larger space of possible mappings between them, so you'll need more of your larger images as training data than you would if you had small images.
Springfield762 also mentioned this: If the mapping between your input and output images is simple, then you'll only need a few examples to learn the mapping successfully. But if you have a complicated mapping, then you'll need much more training data to have a chance at learning the mapping properly.
Implementation complexity
It's unlikely that a tool already exists that would let you just throw image data into an ANN model and have a mapping appear. Most likely you'll need, at a minimum, to implement some code that will pre-process your image data. In addition, if you have lots of large images you'll probably need to write code to handle loading data from disk, etc. (There are a lot of "big data" tools for things like this, but they all require some amount of work to get set up.)
There are many, many open source ANN toolkits out there nowadays. FANN (already mentioned) is a popular one in C++ with bindings in other languages. Caffe is quite popular, and is also implemented in C++ with bindings. There seem to be many toolkits that use Python and Theano or some other GPU acceleration library -- Keras, Lasagne, Hebel, Pylearn2, neon, and Theanets (I wrote this one). Many people use Torch, written in Lua. Matlab has at least one neural network toolbox. I'm less familiar with other ecosystems, but Java seems to have Deeplearning4j, C# has Accord, and even R has darch.
But with any of these neural network toolkits, you're going to have to write some code to load the data, process it into the appropriate input format, construct (or load) a network model, train the model, etc.
The problem you're trying to solve is a canonical classification problem that neural networks can help you solve. You treat the B images as a set of labels that you match to A, and once trained, the neural network will be able to match the B images to new input based on where the network locates new input in a high-dimensional vector space. I assume you'd use some combination of convolutional networks to create your features, and softmax for multinomial classification on the output layer. More here: http://deeplearning4j.org/convolutionalnets.html
Since this has been written there has been a lot of work in the realm of cgans ( conditional generative adversarial networks ) please refer to:
https://arxiv.org/pdf/1611.07004.pdf

Classifying Multivariate Time Series

I currently am working on a time series witch 430 attributes and approx. 80k instances. Now I would like to binary classify each instance (not the whole ts). Everything I found about classifying TS talked about labeling the whole thing.
Is it possible to classify each instance with something like a SVM completely disregarding the sequential nature of the data or would that only result in a really bad classifier?
Which other options are there which classify each instance but still look at the data as a time series?
If the data is labeled, you may have luck by concatenating attributes together, so each instance becomes a single long time series, and by applying the so-called Shapelet Transform. This would result in a vector of values for each of time series which can be fed into SVM, Random Forest, or any other classifier. It could be that picking a right shapelets will allow you to focus on a single attribute when classifying instances.
If it is not labeled, you may try the unsupervised shapelets application first to explore your data and proceed with aforementioned shapelet transform after.
It certainly depends on the data within the 430 attributes,
data types, and especially the problem you want to solve.
In time series analysis, you usually want to exploit the dependencies between the neighboring points, i.e., how they change in time. The examples you may find in books usually talk about a single function f(t): Time -> Real. If I understand it correctly, you want to focus just on the dependencies among the 430 attributes (vertical dependencies) and disregard the horizontal dependencies.
If I were you, I would first try to train multiple classifiers (SVM, Maximum entropy model, Multi-layer perceptron, Random forest, Probabilistic Neural Network, ...) and compare their prediction performance in the frame of your problem.
For training, you can start by feeding all 430 attributes as features to Maxent classifier (can easily handle millions of features).
You also need to perform some N-fold cross-validation to see whether the classifiers are not overfitted. Then pick the best that solves your problem "good enough".
Other ideas if this approach does not perform well:
include features from t-1, t-2...
perform feature selection by trying different subsets of features
derive new time series such as moving averages, wavelet spectrum ... and use them as new features
A nice implementation of Maxent classifier can be found in openNLP.

Exporting Mahout model output as Weka input

I would like to use the output model of a Mahout decision tree training process as the input model for a Weka based classifier.
As the training of a complex decision tree that is based on millions of training records is almost impractical for a single node Weka classifier, I would like to use Mahout to build the model, using for example Random Forest Partial Implementation.
While the algorithm above can be problematic while training, it is rather simple to use it for prediction with Weka on a single machine.
On Mahout wiki site it is stated that the data formats for import include Weka ARFF format, but not for export.
Is it possible to use some of the existing implementations in Mahout to train models that will be used in production with a simple Weka based system?
I don't think it's possible to do what you're asking: .arff is a data format, as are all of the other options in the import/export menus. The classifiers that Weka can save/load are, in fact, Weka's java Classifier objects written to a file using Java's Serializable interface. They're not so much portable trees as they are Java objects that last longer than the JVMs which create them. Thus, to do what you want, either Mahout or Weka would have to be able to produce/read each other's code, and that's not something I can find any documentation of.
My experience is that with several million training records (consisting of ~45 numeric features/columns each), Weka's Random Forest implementation using the default options is very fast (operating in seconds on a single 2.26GHz core), so it may not be necessary to bother with Mahout. Your data set may well have different results, though.

Resources