Outlier Treatment using Python - validation

I am a novice in Data Science and in the problem which I am trying to solve, I am stuck up with the outlier detection and treatment. Some of the insights about the dataset below:
It's a regression problem
Having both numerical and categorical features
Numerical features include both discrete and continuous data columns
Categorical features include mostly nominal & Ordinal data columns
I've done the missing value imputation and categorical data transformation
I am stuck up since I don't know the way of outlier detection and treatment of numerical data. I request any of your valuable help in proceeding further.
Please let me know if you want any snapshot of the numerical data in order to give a solution.
I haven't added it since it's a generic doubt as I don't even know how and what to use for outlier detection and treatment.

Plot the distribution of the numerical data
Do you see a normal distribution or a skewed distribution?
If it is normal. you can fairly take the median and 3 * median
Any value > 3* median is considered an outlier.

Related

Machine learning algorithm for correlation between indicators

I have a dataset with several indicators related to some geographical entities ,I want to study factors that influence an indicator A (among the other indicator) .I need to determine which indicators affect it the most (correlation)
which ML algo should I use
I want to have a kind of scoring function for my indicator A to allow its prediction
enter image description here
What you are looking for are correlation coefficients, you have multiple choices for that, the most commons are:
Pearson's coefficient which only measure linear relationship between two variables, see [Scipy's implementation]
Spearman's coefficient which can show non-linear relationship , see Scipy's implementation
You can also normalize your data using z-normalization and then do a simple Linear regression. The regression coefficient can give you an idea of the influence of each variable on the outcome. However this method is highly sensible to multi-collinearity which might be present, especially if your variables are geographical.
Could you provide an example of the dataset? Discrete or continuous variables? Which software are you using?
Anyway an easy way to test correlation (without going into ML algorithms in the very sense) is to simply perform Pearson's or Spearman's correlation coefficient on selected features or on the whole dataset by creating a matrix of the data. You can do that in Python with NumPy (see this) or in R (see this).
You can also use simple linear regression or logistic/multinomial logistic regression (depending on the nature of your data) to quantify the influence of the other features on your target variables. Just keep in mind that "correlation is not causation. Look here to see some models.
Then it depends on the object of your analysis whether to aggregate all the features of all the geographical points or create covariance matrices for each "subset" of observation related to the geographical points.

price predicting using regreddion

I have dataset of gold prices and after modifying and some preprocessing i ended up with dataframe below:
There is 50,000 record in dataset and there are morethan 500 different markets with different frequencies, all columns expect date are int type and date is datetime object. i need to predict price per unit in some specific dates. but somehow i baffled with so many methods.
My question is what regression algorithm/method is results good prediction for this kind of data ?
In machine learning or data mining as they always say, a lot of things can be done in a lot of ways. Lets try to use elimination to decide on the algorithm for the given problem.The primary case is that the class variable (feature to be predicted) is continuous hence you should use any regression algorithms. I would suggest to go with linear regression, check the accuracy using r^2 score which is basically a squared difference between an actual and a predicted value. If it is not on par, try randomforest regressor.

What estimator to use in scikit-learn?

This is my first brush with machine learning, so I'm trying to figure out how this all works. I have a dataset where I've compiled all the statistics of each player to play with my high school baseball team. I also have a list of all the players that have ever made it to the MLB from my high school. What I'd like to do is split the data into a training set and a test set, and then feed it to some algorithm in the scikit-learn package and predict the probability of making the MLB.
So I looked through a number of sources and found this cheat sheet that suggests I start with linear SVC.
So, then as I understand it I need to break my data into training samples where each row is a player and each column is a piece of data about the player (batting average, on base percentage, yada, yada), X_train; and a corresponding truth matrix of a single row per player that is simply 1 (played in MLB) or 0 (did not play in MLB), Y_train. From there, I just do Fit(X,Y) and then I can use predict(X_test) to see if it gets the right values for Y_test.
Does this seem a logical choice of algorithm, method, and application?
EDIT to provide more information:
The data is made of 20 features such as number of games played, number of hits, number of Home Runs, number of Strike Outs, etc. Most are basic counting statistics about the players career; a few are rates such as batting average.
I have about 10k total rows to work with, so I can split the data based on that; but I have no idea how to optimally split the data, given that <1% have made the MLB.
Alright, here are a few steps that might want to make:
Prepare your data set. In practice, you might want to scale the features, but we'll leave it out to make the first working model as simple as possible. So will just need to split the dataset into test/train set. You could shuffle the records manually and take the first X% of the examples as the train set, but there's already a function for it in scikit-learn library: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. You might want to make sure that both: positive and negative examples are present in the train and test set. To do so, you can separate them before the test/train split to make sure that, say 70% of negative examples and 70% of positive examples go the training set.
Let's pick a simple classifier. I'll use logistic regression here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html, but other classifiers have a similar API.
Creating the classifier and training it is easy:
clf = LogisticRegression()
clf.fit(X_train, y_train)
Now it's time to make our first predictions:
y_pred = clf.predict(X_test)
A very important part of the model is its evaluation. Using accuracy is not a good idea here: the number of positive examples is very small, so the model that unconditionally returns 0 can get a very high score. We can use the f1 score instead: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html.
If you want to predict probabilities instead of labels, you can just use the predict_proba method of the classifier.
That's it. We have a working model! Of course, there are a lot thing you may try to improve, such as scaling the features, trying different classifiers, tuning their hyperparameters, but this should be enough to get started.
If you don't have a lot of experience in ML, in scikit learn you have classification algorithms (if the target of your dataset is a boolean or a categorical variable) or regression algorithms (if the target is a continuous variable).
If you have a classification problem, and your variables are in a very different scale a good starting point is a decision tree:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
The classifier is a Tree and you can see the decisions that are taking in the nodes.
After that you can use random forest, that is a group of decision trees that average results:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
After that you can put the same scale in every feature:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
And you can use other algorithms like SVMs.
For every algorithm you need a technique to select its parameters, for example cross validation:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)
But a good course is the best option to learn. In coursera you can find several good courses like this:
https://www.coursera.org/learn/machine-learning

Predicting Data Set

In mathematica I am have 10 data sets. I am trying to figure out how to predict future outcomes based on this data. The data almost follows a normal distribution. I want to predict what the average curve looks like based on the data I have. Is there any way to do this?
See the documentation for
https://reference.wolfram.com/language/howto/GetResultsForFittedModels.html
or for a time series
https://reference.wolfram.com/language/ref/TimeSeriesModelFit.html

Appropriate clustering method for 1 or 2 dimensional data

I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.
An example of data that should be group together is:
m/z time
337.65 1524.6
337.65 1524.6
337.65 1604.3
However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/DBSCAN
Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.
Why don't you just set a threshold?
If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.
That sounds like the straightforward way of preprocessing this data to me.
Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.
For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).
You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.

Resources