I am following a course on udemy about data science with python.
The course is focused on the output of the algorithm and less on the algorithm by itself.
In particular I am performing a decision tree. Every doing I run the algorithm on python, also with the same samples, the algorithm gives me a slightly different decision tree. I have asked to the tutors and they told me "The decision trees does not guarantee the same results each run because of its nature." Someone can explain me why more in detail or maybe give me an advice for a good book about it?
I did the decision tree of my data importing:
import numpy as np
import pandas as pd
from sklearn import tree
and doing this command:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,y)
where X are my feature data and y is my target data
Thank you
The DecisionTreeClassifier() function is apparently documented here:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
So this function has many arguments. But in Python, function arguments may have default values. Here, all arguments have default values, so you can even call the function with an empty argument list, like this:
clf = tree.DecisionTreeClassifier()
The parameter of interest, random_state is documented like this:
random_state: int, RandomState instance or None, default=None
So your call is equivalent to, among many other things:
clf = tree.DecisionTreeClassifier(random_state=None)
The None value tells the library that you don't want to bother with providing a seed (that is, an initial state) to the underlying pseudo-random number generator. Hence, the library has to come up with some seed.
Typically, it will take the current time value, with microsecond precision if possible, and apply some hash function. So at every call you will get a different initial state, and so a different sequence of pseudo-random numbers. Hence, a different tree.
You might want to try forcing the seed. For example:
clf = tree.DecisionTreeClassifier(random_state=42)
and see if your problem persists.
Now, regarding why does the decision tree require pseudo-random numbers, this is discussed for example here:
According to scikit-learn’s “best” and “random” implementation [4], both the “best” splitter and the “random” splitter uses Fisher-Yates-based algorithm to compute a permutation of the features array.
The Fisher-Yates algorithm is the most common way to compute a random permutation. Also, if stopped before completion, it can be used to extract a random subset of the data sample, for example if you need a random 10% of the sample to be excluded from the data fitting and set aside for a later cross-validation step.
Side note: in some circumstances, non-reproducibility can become a pain point, for example if you want to study the influence of an external parameter, say some global Y values bias. In that case, you don't want uncontrolled changes in the random numbers to blur the effects of your parameter changes. Hence the need for the API to provide some way to control the seed value.
Related
I am trying to use hmmlearn's GaussianHMM to fit a Hidden Markov Model with 2 main states, while allowing for multiple exogenous variables. My goal is to determine two states of GDP growth (one with low variance and the other with high variance), these states then depend on lagged unemployment, lagged commercial confidence level etc. I have a couple of questions:
Using hmmlearn's GaussiansHMM, I have read through the documentation but I cannot find any mention of exogenous variable. Using the method fit(X, lengths=None), I see that X can have n_features columns, do I understand correctly that I should pass in an array with the first column being the endogenous varible (GDP growth in my case) and the rest of columns are the exogenous variables ?
Is hmmlearn's GaussianHMM equivalent to statsmodels.tsa.regime_switching.markov_regression.MarkovRegression ? This model allows for exog_tvtp which means that exogenous variables are used to calculate a time varying transition probabilities matrix.
An example of fitting the monthly returns of the S&P500, no exogenous variable.
import numpy as np
import pandas as pd
from hmmlearn.hmm import GaussianHMM
import yfinance as yf
sp500 = yf.download("^GSPC")["Adj Close"]
# Fitting an absolute return model because we only care about volatility #
rets = np.log(sp500/sp500.shift(1)).dropna()
rets.index = pd.to_datetime(rets.index)
rets = rets.resample("M").sum()
model = GaussianHMM(n_components=2)
model.fit(rets.to_frame())
state_sequence = model.predict(rets.to_frame())
Imagine if I want to add a dependency on exogenous variables to the returns of the S&P500, for example on economic growth or past volatilities, is there a way to do this ?
Thanks for any help.
n_features can be thought of as the temporal domain, and should not be conflated with features that describe the complexity of ie. a regression model.
If your hidden states are the two states of GDP growth, then the observed variable (or emissions) that you are trying to infer the hidden states from should be the feature space (a.k.a. n_features).
This should be a single measurement (emission) descriptive of a combination of your "exogenous variables", collected over time. hmmlearn will not be able to take multivariate emissions.
Suggestions
If I understand your question correctly, perhaps what you might be looking for are Kalman filters. KF produces estimates of unknowns based on multiple measurements (ie. all of your exogenous variables) that ultimately produce a model more accurate than those based on a single measurement.
If you wish each hidden state to have multiple independent emissions then what you might be looking for is a structured perceptron. This is discussed here: Hidden Markov Model for multiple observed variables
I want to construct a LightGBM Dataset object from very large X and y, which can not be load to memory. Is there any method that can construct Dataset in "batch"? eg. something like
import lightgbm as lgb
ds = lgb.Dataset()
for X, y in data_generator():
ds.add_new_data(data=X, label=y)
regarding the data there are a few hacks, for example, if your data has numeric, you make sure the precision are too long, e.g. probably two digits would be enough (it depends on your data). or if you have categorical data make sure you store them with digits. but probably you are looking for a better approach
There is a concept called incremental learning. Basically you make a model (a tree) in your first iteration using the first batch of data. Then for your next model, you use that tree as a template and only updates the values (you can also allow for shrinkage). you can use the keep_training_booster for such scenario and please read on your own to learn the mechanism.
The third technique is you make multiple models: say you divide your data into N pieces and make N models, then use an ensemble approach. This way you have used your entire data with N number of observations.
I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal.
There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using their 'expert knowledge', but I want to automate this process of clusterization.
Most algorithms for clusterization use distance (Euclidean, Mahalanobdis and so on) between objects to group them in clusters. But it is hard to find some reasonable metrics for mixed data types, i.e. we can't find a distance between 'glass' and 'steel'. So I came to the conclusion that I have to use conditional probabilities P(feature = 'something' | Class) and some utility function that depends on them. It is reasonable for categorical variables, and it works fine with numeric variables assuming they are distributed normally.
So it became clear to me that algorithms like K-means will not produce good results.
At this time I try to work with COBWEB algorithm, that fully matches my ideas of using conditional probabilities. But I faced another obsacles: results of clusterization are really hard to interpret, if not impossible. As a result I wanted to get something like a set of rules that describes each cluster (e.g. if feature1 = 'a' and feature2 in [30, 60], it is cluster1), like descision trees for classification.
So, my question is:
Is there any existing clusterization algorithm that works with mixed data type and produces an understandable (and reasonable for humans) description of clusters.
Additional info:
As I understand my task is in the field of conceptual clustering. I can't define a similarity function as it was suggested (it as an ultimate goal of the whoal project), because of the field of study - it is very complicated and mercyless in terms of formalization. As far as I understand the most reasonable approach is the one used in COBWEB, but I'm not sure how to adapt it, so I can get an undestandable description of clusters.
Decision Tree
As it was suggested, I tried to train a decision tree on the clustering output, thus getting a description of clusters as a set of rules. But unfortunately interpretation of this rules is almost as hard as with the raw clustering output. First of only a few first levels of rules from the root node do make any sense: closer to the leaf - less sense we have. Secondly, these rules doesn't match any expert knowledge.
So, I came to the conclusion that clustering is a black-box, and it worth not trying to interpret its results.
Also
I had an interesting idea to modify a 'decision tree for regression' algorithm in a certain way: istead of calculating an intra-group variance calcualte a category utility function and use it as a split criterion. As a result we should have a decision tree with leafs-clusters and clusters description out of the box. But I haven't tried to do so, and I am not sure about accuracy and everything else.
For most algorithms, you will need to define similarity. It doesn't need to be a proper distance function (e.g. satisfy triangle inequality).
K-means is particularly bad, because it also needs to compute means. So it's better to stay away from it if you cannot compute means, or are using a different distance function than Euclidean.
However, consider defining a distance function that captures your domain knowledge of similarity. It can be composed of other distance functions, say you use the harmonic mean of the Euclidean distance (maybe weighted with some scaling factor) and a categorial similarity function.
Once you have a decent similarity function, a whole bunch of algorithms will become available to you. e.g. DBSCAN (Wikipedia) or OPTICS (Wikipedia). ELKI may be of interest to you, they have a Tutorial on writing custom distance functions.
Interpretation is a separate thing. Unfortunately, few clustering algorithms will give you a human-readable interpretation of what they found. They may give you things such as a representative (e.g. the mean of a cluster in k-means), but little more. But of course you could next train a decision tree on the clustering output and try to interpret the decision tree learned from the clustering. Because the one really nice feature about decision trees, is that they are somewhat human understandable. But just like a Support Vector Machine will not give you an explanation, most (if not all) clustering algorithms will not do that either, sorry, unless you do this kind of post-processing. Plus, it will actually work with any clustering algorithm, which is a nice property if you want to compare multiple algorithms.
There was a related publication last year. It is a bit obscure and experimental (on a workshop at ECML-PKDD), and requires the data set to have a quite extensive ground truth in form of rankings. In the example, they used color similarity rankings and some labels. The key idea is to analyze the cluster and find the best explanation using the given ground truth(s). They were trying to use it to e.g. say "this cluster found is largely based on this particular shade of green, so it is not very interesting, but the other cluster cannot be explained very well, you need to investigate it closer - maybe the algorithm discovered something new here". But it was very experimental (Workshops are for work-in-progress type of research). You might be able to use this, by just using your features as ground truth. It should then detect if a cluster can be easily explained by things such as "attribute5 is approx. 0.4 with low variance". But it will not forcibly create such an explanation!
H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011. http://dme.rwth-aachen.de/en/MultiClust2011
A common approach to solve this type of clustering problem is to define a statistical model that captures relevant characteristics of your data. Cluster assignments can be derived by using a mixture model (as in the Gaussian Mixture Model) then finding the mixture component with the highest probability for a particular data point.
In your case, each example is a vector has both real and categorical components. A simple approach is to model each component of the vector separately.
I generated a small example dataset where each example is a vector of two dimensions. The first dimension is a normally distributed variable and the second is a choice of five categories (see graph):
There are a number of frameworks that are available to run monte carlo inference for statistical models. BUGS is probably the most popular (http://www.mrc-bsu.cam.ac.uk/bugs/). I created this model in Stan (http://mc-stan.org/), which uses a different sampling technique than BUGs and is more efficient for many problems:
data {
int<lower=0> N; //number of data points
int<lower=0> C; //number of categories
real x[N]; // normally distributed component data
int y[N]; // categorical component data
}
parameters {
real<lower=0,upper=1> theta; // mixture probability
real mu[2]; // means for the normal component
simplex[C] phi[2]; // categorical distributions for the categorical component
}
transformed parameters {
real log_theta;
real log_one_minus_theta;
vector[C] log_phi[2];
vector[C] alpha;
log_theta <- log(theta);
log_one_minus_theta <- log(1.0 - theta);
for( c in 1:C)
alpha[c] <- .5;
for( k in 1:2)
for( c in 1:C)
log_phi[k,c] <- log(phi[k,c]);
}
model {
theta ~ uniform(0,1); // equivalently, ~ beta(1,1);
for (k in 1:2){
mu[k] ~ normal(0,10);
phi[k] ~ dirichlet(alpha);
}
for (n in 1:N) {
lp__ <- lp__ + log_sum_exp(log_theta + normal_log(x[n],mu[1],1) + log_phi[1,y[n]],
log_one_minus_theta + normal_log(x[n],mu[2],1) + log_phi[2,y[n]]);
}
}
I compiled and ran the Stan model and used the parameters from the final sample to compute the probability of each datapoint under each mixture component. I then assigned each datapoint to the mixture component (cluster) with higher probability to recover the cluster assignments below:
Basically, the parameters for each mixture component will give you the core characteristics of each cluster if you have created a model appropriate for your dataset.
For heterogenous, non-Euclidean data vectors as you describe, hierarchical clustering algorithms often work best. The conditional probability condition you describe can be incorporated as an ordering of attributes used to perform cluster agglomeration or division. The semantics of the resulting clusters are easy to describe.
I applied the KNN algorithm in matlab for classifying handwritten digits. the digits are in vector format initially 8*8, and stretched to form a vector 1*64. So each time I am comparing the first digit with all the rest data set, (which is quite huge), then the second one with the rest of the set etc etc etc. Now my question is, isn't 1 neighbor the best choice always? Since I am using Euclidean Distance, (I pick the one that is closer) why should I also choose 2 or 3 more neighbors since I got the closest digit?
Thanks
You have to take noise into consideration. Assume that maybe some of your classified examples were classified wrongly, or maybe one of them is oddly very close to other examples - which are different, but it is actually only a "glitch". In these cases - classifying according to this off the track example could lead to a mistake.
From personal experience, usually the best results are achieved for k=3/5/7, but it is instance dependent.
If you want to achieve best performance - you should use cross validation top chose the optimal k for your specific instance.
Also, it is common to use only odd number as k for KNN, to avoid "draws"
A simple program to demonstrate ML Knn algorithm
Knn algorithm works by training a computer with set of data and passing inputs to get expected output. Ex:- consider a parent want to train his child to identify picture of "Rabbit", here parent will show n number of photos of a a Rabbit and if the photo belongs to Rabbit then we shout Rabbit else we will move on , like this in this approach a supervision is taken care to the computer by feeding set of data to get expected output
from sklearn.neigbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
df=pd.read_csv("D:\\heart.csv")
new_data{"data":np.array(df[["age","gende","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal"]],ndmin=2),"target":np.array(df["target"]),"target_names":np.array(["No_problem","Problem"])}
X_train,X_test,Y_train,Y_test=train_test_split(new_data["data"],new_data["target"],random_state=0)
kn=KNeighborsClassifier(n_neighbors=3)
kn.fit(X_train,Y_train)
x_new=np.array([[71,0,0,112,149,0,1,125,0,1.6,1,0,2]])
res=kn.predict(x_new)
print("The predicted k value is : {}\n".format(res))
print("The predicted names is : {}\n".format(new_data["target_names"][res])
print("Score is : {:.2f}".format(kn.score(X_train,Y_train)))
Say I have 100 records, and I want to mock out the created_at date so it fits on some curve. Is there a library to do that, or what formula could I use? I think this is along the same track:
Generate Random Numbers with Probabilistic Distribution
I don't know much about how they are classified in mathematics, but I'm looking at things like:
bell curve
logarithmic (typical biology/evolution) curve?
...
Just looking for some formulas in code so I can say this:
Given 100 records, a timespan of 1.week, and an interval of 12.hours
set created_at for each record such that it fits, roughly, to curve
Thanks so much!
Update
I found this forum post about ruby algorithms, which led me to rsruby, an R/Ruby bridge, but that seems like too much.
Update 2
I wrote this little snippet trying out the gsl library, getting there...
Generate test data in Rails where created_at falls along a Statistical Distribution
I recently came across croupier, a ruby gem that aims to generate numbers according to a variety of statistical distributions.
I have yet to try it but it sounds quite promising.
You can generate UNIX timestamps which are really just integers. First figure out when you want to start, for example now:
start = DateTime::now().to_time.to_i
Find out when the end of your interval should be (say 1 week later):
finish = (DateTime::now()+1.week).to_time.to_i
Ruby uses this algorithm to generate random numbers. It is almost uniform. Then generate random numbers between the two:
r = Random.new.rand(start..finish)
Then convert that back to a date:
d = Time.at(r)
This looks promising as well:
http://rb-gsl.rubyforge.org/files/rdoc/randist_rdoc.html
And this too:
http://rb-gsl.rubyforge.org/files/rdoc/rng_rdoc.html
From wiki:
There are a couple of methods to
generate a random number based on a
probability density function. These
methods involve transforming a uniform
random number in some way. Because of
this, these methods work equally well
in generating both pseudo-random and
true random numbers.
One method, called the inversion
method, involves integrating up to
an area greater than or equal to the
random number (which should be
generated between 0 and 1 for proper
distributions).
A second method, called the
acceptance-rejection method,
involves choosing an x and y value and
testing whether the function of x is
greater than the y value. If it is,
the x value is accepted. Otherwise,
the x value is rejected and the
algorithm tries again.
The first method is the one used in the accepted answer in your SO linked question: Generate Random Numbers with Probabilistic Distribution
Another option is the Distribution gem under SciRuby. You can generate normal numbers by:
require 'distribution'
rng = Distribution::Normal.rng
random_numbers = Array.new(100).map { rng.call }
There are RNGs for various other distributions as well.