Best method to identify and replace outlier for Salary column in python - sklearn-pandas

What is best method to identify and replace outlier for ApplicantIncome,
CoapplicantIncome,LoanAmount,Loan_Amount_Term column in pandas python.
I tried IQR with seaborne boxplot, and tried to identified the outlet and fill with NAN record after that take mean of ApplicantIncome and filled with NAN records.
Try to take group of below combination column ex: gender, education, selfemployed, Property_Area
And having below column in my dataframe
Loan_ID LP001357
Gender Male
Married NaN
Dependents NaN
Education Graduate
Self_Employed No
ApplicantIncome 3816
CoapplicantIncome 754
LoanAmount 160
Loan_Amount_Term 360
Credit_History 1
Property_Area Urban
Loan_Status Y

Outliers
Just like missing values, your data might also contain values that diverge heavily from the big majority of your other data. These data points are called “outliers”. To find them, you can check the distribution of your single variables by means of a box plot or you can make a scatter plot of your data to identify data points that don’t lie in the “expected” area of the plot.
The causes for outliers in your data might vary, going from system errors to people interfering with the data through data entry or data processing, but it’s important to consider the effect that they can have on your analysis: they will change the result of statistical tests such as standard deviation, mean or median, they can potentially decrease the normality and impact the results of statistical models, such as regression or ANOVA.
To deal with outliers, you can either delete, transform, or impute them: the decision will again depend on the data context. That’s why it’s again important to understand your data and identify the cause for the outliers:
If the outlier value is due to data entry or data processing errors,
you might consider deleting the value.
You can transform the outliers by assigning weights to your
observations or use the natural log to reduce the variation that the
outlier values in your data set cause.
Just like the missing values, you can also use imputation methods to
replace the extreme values of your data with median, mean or mode
values.
You can use the functions that were described in the above section to deal with outliers in your data.
Following links will be useful for you:
Python data cleaning
Ways to detect and remove the outliers

Related

How does "Addressing missing data" help KNN function better?

Source:- https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/
This page has a section quoting the following passage:-
Best Prepare Data for KNN
Rescale Data: KNN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian
distribution.
Address Missing Data: Missing data will mean that the distance between samples cannot be calculated. These samples could be excluded or the missing values could be imputed.
Lower Dimensionality: KNN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as well as other techniques. KNN can benefit from feature selection that reduces the dimensionality of the input feature space.
Please, can someone explain the Second point, i.e. Address Missing Data, in detail?
Missing data in this context means that some samples do not have all the existing features.
For example:
Suppose you have a database with age and height for a group of individuals.
This would mean that for some persons either the height or the age is missing.
Now, why this affects KNN?
Given a test sample
KNN finds the samples that are closer to it (Aka: the students with similar age and height).
KNN does this to make some inference about the test sample based on its nearest neighbors.
If you want to find these neighbors you must be able to compute the distance between samples. To compute the distance between 2 samples you must have all the features for these 2 samples.
If some of them are missing you won't be able to compute distance.
So implicitly you would be lossing the samples with missing data

Dealing with Missing Values in dataset

Upto what extent we should fill the missing values for a feature in a dataset so that it doesnt become redundant ?
I have a dataset which has a max of 42000 observations. There are three features which have around 20000, 35000 and 7000 values missing. Should I still use them by filling these missing values or dump these three features?
How do we decide the threshold for keeping or dumping a feature given the number of missing values of that feature ?
Generally, you can interpolate missing values from nearest samples in dataset, i like this manual for pandas about missing values http://pandas.pydata.org/pandas-docs/stable/missing_data.html, it lists many possible techniques to interpolate missing values from known part of dataset.
But in your case, i think that it's better to just remove those 2 first features, because i doubt that there could be any good interpolation for missing values, when you have such big amount of them, almost more than half of all values.
But you may try to fix third feature with missing values.

NULL values across a dimension in Support Vector Machine

I am designing a support vector machine considering n dimensions. Along every dimension, the values could range from [0-1]. Now, if I am unable to determine the value across a particular dimension from the original data set, for a particular data point due to various reasons, what should the value along that dimension be for the SVM? Can I just put it as [-1] indicating a missing value?
Thanks
Abhishek S
You would be better served leaving the missing value out altogether if the dimension won't be able to contribute to your machine's partitioning of the space. This is because the only thing the SVM can do is place zero weight on that dimension as far as classification power, as all of the points in that dimension are at the same place.
Thus each pass over that dimension is just wasted computational resources. If recovering this value is of importance, you may be able to use a regression model of some type to try to get estimated values back, but if that estimated value is generated from your other data, yet again it won't actually contribute to your SVM because the data in that estimated dimension is nothing more that a summary of the data you used to generate it (which I would assume would be in your SVM model already).

What is the relative performance of 1 geometry column vs 4 decimals in Sql Server 2008?

I need to represent the dimensions of a piece of a quadrilateral rectangle surface in a SQL Server 2008 database. I will need to perform queries based on the distance between different points and the total area of the surface.
Will my performance be better using a geometry datatype or 4 decimal columns? Why?
If the geometry datatype is unnecessary in this situation, what amount of complexity in the geometrical shape would be required for using the geometry datatype to make sense?
I have not used the geometry datatype, and have never had reason to read up on it. Even so, it seems to me that if you’re just doing basic arithmatic on a simple geometric object, the mundane old SQL datatypes should be quite effiicient, particularly if you toss in some calculated columns for frequently used calculations.
For example:
--DROP TABLE MyTable
CREATE TABLE MyTable
(
X1 decimal not null
,Y1 decimal not null
,X2 decimal not null
,Y2 decimal not null
,Area as abs((X2-X1) * (Y2-Y1))
,XLength as abs((X2 - X1))
,YLength as abs((Y2 - Y1))
,Diagonal as sqrt(power(abs((X2 - X1)), 2) + power(abs((Y2 - Y1)), 2))
)
INSERT MyTable values (1,1,4,5)
INSERT MyTable values (4,5,1,1)
INSERT MyTable values (0,0,3,3)
SELECT * from MyTable
Ugly calculations, but they won’t be performed unless and until they are actually referenced (or unless you choose to index them). I have no statistics, but performing the same operations via the Geometry datatype probably means accessing rarely used mathematical subroutines, possibly embedded in system CLR assemblies, and I just can’t see that being significantly faster than the bare-bones SQL arithmatic routines.
I just took a look in BOL on the Geometry datatype. (a) Zounds! (b) Cool! Check out the entries under “geomety Data Type Method Reference” (online here , but you want to look at the expanded treeview under this entry.) If that’s the kind of functionality you’ll be needing, by all means use the Geometry data type, but for simple processing, I’d stick with the knucklescraper datatypes.
the geometry data types are more complex than simple decimals so there just be an overhead. But they do provide functions that calculate distance between two points and i would assume these have been optermised. The question might be if you implemented the distance between points logic - would this take longer than having the data in appropriate format in the first place.
As every DB question might relate to the ratio of inserts v selects/calc's
Geometry datatype is Spatial and decimal isn't,
Spatial vs. Non-spatial Data
Spatial data includes location, shape, size, and orientation.
For example, consider a particular square:
its center (the intersection of its diagonals) specifies its location
its shape is a square
the length of one of its sides specifies its size
the angle its diagonals make with, say, the x-axis specifies its orientation.
Spatial data includes spatial relationships. For example, the arrangement of ten bowling pins is spatial data.
Non-spatial data (also called attribute or characteristic data) is that information which is independent of all geometric considerations.
For example, a person?s height, mass, and age are non-spatial data because they are independent of the person?s location.
It?s interesting to note that, while mass is non-spatial data, weight is spatial data in the sense that something?s weight is very much dependent on its location!
It is possible to ignore the distinction between spatial and non-spatial data. However, there are fundamental differences between them:
spatial data are generally multi-dimensional and autocorrelated.
non-spatial data are generally one-dimensional and independent.
These distinctions put spatial and non-spatial data into different philosophical camps with far-reaching implications for conceptual, processing, and storage issues.
For example, sorting is perhaps the most common and important non-spatial data processing function that is performed.
It is not obvious how to even sort locational data such that all points end up ?nearby? their nearest neighbors.
These distinctions justify a separate consideration of spatial and non-spatial data models. This unit limits its attention to the latter unless otherwise specified.
Here's some more if you're interested:
http://www.ncgia.ucsb.edu/giscc/units/u045/u045_f.html
Heres a link i found about Benchmarking Spatial Data Warehouses: http://hpc.ac.upc.edu/Talks/dir08/T000327/paper.pdf

Random distribution of data

How do I distribute a small amount of data in a random order in a much larger volume of data?
For example, I have several thousand lines of 'real' data, and I want to insert a dozen or two lines of control data in a random order throughout the 'real' data.
Now I am not trying to ask how to use random number generators, I am asking a statistical question, I know how to generate random numbers, but my question is how do I ensure that this the data is inserted in a random order while at the same time being fairly evenly scattered through the file.
If I just rely on generating random numbers there is a possibility (albeit a very small one) that all my control data, or at least clumps of it, will be inserted within a fairly narrow selection of 'real' data. What is the best way to stop this from happening?
To phrase it another way, I want to insert control data throughout my real data without there being a way for a third party to calculate which rows are control and which are real.
Update: I have made this a 'community wiki' so if anyone wants to edit my question so it makes more sense then go right ahead.
Update: Let me try an example (I do not want to make this language or platform dependent as it is not a coding question, it is a statistical question).
I have 3000 rows of 'real' data (this amount will change from run to run, depending on the amount of data the user has).
I have 20 rows of 'control' data (again, this will change depending on the number of control rows the user wants to use, anything from zero upwards).
I now want to insert these 20 'control' rows roughly after every 150 rows or 'real' data has been inserted (3000/20 = 150). However I do not want it to be as accurate as that as I do not want the control rows to be identifiable simply based on their location in the output data.
Therefore I do not mind some of the 'control' rows being clumped together or for there to be some sections with very few or no 'control' rows at all, but generally I want the 'control' rows fairly evenly distributed throughout the data.
There's always a possibility that they get close to each other if you do it really random :)
But What I would do is:
You have N rows of real data and x of control data
To get an index of a row you should insert i-th control row, I'd use: N/(x+1) * i + r, where r is some random number, diffrent for each of the control rows, small compared to N/x. Choose any way of determining r, it can be either gaussian or even flat distribution. i is an index of the control row, so it's 1<=i<x
This way you can be sure that you avoid condensation of your control rows in one single place. Also you can be sure that they won't be in regular distances from each other.
Here's my thought. Why don't you just loop through the existing rows and "flip a coin" for each row to decide whether you will insert random data there.
for (int i=0; i<numberOfExistingRows; i++)
{
int r = random();
if (r > 0.5)
{
InsertRandomData();
}
}
This should give you a nice random distribution throughout the data.
Going with the 3000 real data rows and 20 control rows for the following example (I'm better with example than with english)
If you were to spread the 20 control rows as evenly as possible between the 3000 real data rows you'd insert one at each 150th real data row.
So pick that number, 150, for the next insertion index.
a) Generate a random number between 0 and 150 and subtract it from the insertion index
b) Insert the control row there.
c) Increase insertion index by 150
d) Repeat at step a)
Of course this is a very crude algorithm and it needs a few improvements :)
If the real data is large or much larger than the control data, just generate interarrival intervals for your control data.
So pick a random interval, copy out that many lines of real data, insert control data, repeat until finished. How to pick that random interval?
I'd recommend using a gaussian deviate with mean set to the real data size divided by the control data size, the former of which could be estimated if necessary, rather than measured or assumed known. Set the standard deviation of this gaussian based on how much "spread" you're willing to tolerate. Smaller stddev means a more leptokurtic distribution means tighter adherence to uniform spacing. Larger stdev means a more platykurtic distribution and looser adherence to uniform spacing.
Now what about the first and last sections of the file? That is: what about an insertion of control data at the very beginning or very end? One thing you can do is to come up with special-case estimates for these... but a nice trick is as follows: start your "index" into the real data at minus half the gaussian mean and generate your first deviate. Don't output any real data until your "index" into the real data is legit.
A symmetric trick at the end of the data should also work quite well (simply: keep generating deviates until you reach an "index" at least half the gaussian mean beyond the end of the real data. If the index just before this was off the end, generate data at the end.
You want to look at more than just statistics: it's helpful in developing an algorithm for this sort of thing to look at rudimentary queueing theory. See wikipedia or the Turing Omnibus, which has a nice, short chapter on the subject whose title is "Simulation".
Also: in some circumstance non-gaussian distributions, particularly the Poisson distribution, give better, more natural results for this sort of thing. The algorithm outline above still applies using half the mean of whatever distribution seems right.

Resources