Upto what extent we should fill the missing values for a feature in a dataset so that it doesnt become redundant ?
I have a dataset which has a max of 42000 observations. There are three features which have around 20000, 35000 and 7000 values missing. Should I still use them by filling these missing values or dump these three features?
How do we decide the threshold for keeping or dumping a feature given the number of missing values of that feature ?
Generally, you can interpolate missing values from nearest samples in dataset, i like this manual for pandas about missing values http://pandas.pydata.org/pandas-docs/stable/missing_data.html, it lists many possible techniques to interpolate missing values from known part of dataset.
But in your case, i think that it's better to just remove those 2 first features, because i doubt that there could be any good interpolation for missing values, when you have such big amount of them, almost more than half of all values.
But you may try to fix third feature with missing values.
Related
I want to build an application that would do something equivalent to running lsof (maybe changing it to output differently, because string processing may mean it is not real time enough) in a loop and then associate each line (entries) with what iteration it was present in, what I will be referring further as frames, as later on it will be better for understanding. My intention with it is that showing the times in which files are open by applications can reveal something about their structure, while not having big impact on their execution, which is often a problem. One problem I have is on processing the output, which would be a table relating "frames X entry", for that I am already anticipating that I will have wildly variable entry lengths. Which can fall in that problem of representing on geometry when you have very different scales, the smaller get infinitely small, while the bigger gets giant and fragmentation makes it even worse; so my question is if plotting libraries deal with this problem and how they do it
The easiest and most well-established technique for showing both small and large values in reasonable detail is a logarithmic scale. Instead of plotting raw values, plot their logarithms. This is notoriously problematic if you can have zero or even negative values, but as I understand your situations all your lengths would be strictly positive so this should work.
Another statistical solution you could apply is to plot ranks instead of raw values. Take all the observed values, and put them in a sorted list. When plotting any single data point, instead of plotting the value itself you look up that value in the list of values (possibly using binary search since it's a sorted list) then plot the index at which you found the value.
This is a monotonous transformation, so small values map to small indices and big values to big indices. On the other hand it completely discards the actual magnitude, only the relative comparisons matter.
If this is too radical, you could consider using it as an ingredient for something more tuneable. You could experiment with a linear combination, i.e. plot
a*x + b*log(x) + c*rank(x)
then tweak a, b and c till the result looks pleasing.
I am looking to find a way in Google Spreadsheets to show a random percentage between 93% and 96%. This would include, 93%, 94%, etc up to 96% obviously.
I want this in a cell. I currently have:
=RANDBETWEEN(0.93,0.96)
I have also "format as a percentage" for that cell.
What it's displaying is 100%. Why 100%? I don't understand. Can somebody help me understand what I am doing wrong?
I have tried =RANDBETWEEN code but it may be wrong.
=RANDBETWEEN(0.93,0.96)
RANDBETWEEN uses whole values as parameters. Divide by 100 to get the number you are looking for
=RANDBETWEEN(93, 96)/100
without formatting it internally it would be:
=TEXT(RANDBETWEEN(93, 96)/100, "#%")
RANDBETWEEN(x,y) only works with integer values, so it's automatically returning a 1 instead of the decimal value you're trying to use.
To work around this and still use the Percentage formatting in your cell, you can use
=RANDBETWEEN(93,96)/100
This should produce the result you're looking for
https://support.google.com/docs/answer/3093507?hl=en
"Values with decimal parts may be used for low and/or high; this will cause the least and greatest possible values to be the next integer greater than low and/or the next integer less than high, respectively."
RANDBETWEEN will only return integers. Therefore you need to do some math to get the precision you are looking for.
=RANDBETWEEN(93, 96)/100
What is best method to identify and replace outlier for ApplicantIncome,
CoapplicantIncome,LoanAmount,Loan_Amount_Term column in pandas python.
I tried IQR with seaborne boxplot, and tried to identified the outlet and fill with NAN record after that take mean of ApplicantIncome and filled with NAN records.
Try to take group of below combination column ex: gender, education, selfemployed, Property_Area
And having below column in my dataframe
Loan_ID LP001357
Gender Male
Married NaN
Dependents NaN
Education Graduate
Self_Employed No
ApplicantIncome 3816
CoapplicantIncome 754
LoanAmount 160
Loan_Amount_Term 360
Credit_History 1
Property_Area Urban
Loan_Status Y
Outliers
Just like missing values, your data might also contain values that diverge heavily from the big majority of your other data. These data points are called “outliers”. To find them, you can check the distribution of your single variables by means of a box plot or you can make a scatter plot of your data to identify data points that don’t lie in the “expected” area of the plot.
The causes for outliers in your data might vary, going from system errors to people interfering with the data through data entry or data processing, but it’s important to consider the effect that they can have on your analysis: they will change the result of statistical tests such as standard deviation, mean or median, they can potentially decrease the normality and impact the results of statistical models, such as regression or ANOVA.
To deal with outliers, you can either delete, transform, or impute them: the decision will again depend on the data context. That’s why it’s again important to understand your data and identify the cause for the outliers:
If the outlier value is due to data entry or data processing errors,
you might consider deleting the value.
You can transform the outliers by assigning weights to your
observations or use the natural log to reduce the variation that the
outlier values in your data set cause.
Just like the missing values, you can also use imputation methods to
replace the extreme values of your data with median, mean or mode
values.
You can use the functions that were described in the above section to deal with outliers in your data.
Following links will be useful for you:
Python data cleaning
Ways to detect and remove the outliers
I have formulated a solution to a problem where I am storing parameters in a set of tables, and I want to be able to look up the parameters based on multiple criteria.
For example, if criteria 1 and criteria 2 can each be either A or B, then I'd have four potential parameters - one for each combination A&A, A&B, B&A and B&B. For these sort of criteria I could concatenate the fields or something similar and create a unique key to look up each value quickly.
Unfortunately not all of my criteria are like this. Some of the criteria are numerical and I only care about whether or not a result sits above or below a boundary. That also wouldn't be a problem on its own - I could maybe use a binary search or something relatively quick to find the nearest key above or below my value.
My problem is I need to include a number of each in the same table. In other words, I could have three criteria - two with A/B entries, and one with less-than-x/greater-than-x type entries, where x is in no way fixed. So in this example I would have a table with 8 entries. I can't just do a binary search for the boundary because the closest boundary won't necessarily be applicable due to the other criteria. For example, if the first two criteria are A&B, then the closest boundary might be 100, but if the if first two criteria are A&A, the closest boundary might be 50. If I want to look up A, A, 101, then I want it to recognise that 50 is the closest boundary that applies - not 100.
I have a procedure to do the lookup but it gets very slow as the tables get bigger - it basically goes through each criteria, checks if a match is still possible, and if so it looks at more criteria - if not, it moves on to check the next entry in the table. So in other words, my procedure requires cycling through the table entries one by one and checking for a match. I have tried to optimise that by ensuring the tables that are input to the procedure are as small as possible and by making sure it looks at the criteria that are least likely to match first (so that it checks each entry as quickly as possible) but it is still very slow.
The biggest tables are maybe 200 rows with about 10 criteria to check, but many are much smaller (maybe 10x5). The issue is that I need to call the procedure many times during my application, so algorithms with some initial overhead don't necessarily make things better. I do have some scope to change the format of the tables before runtime but I would like to keep away from that as much as possible (while recognising it may be the only way forward).
I've done quite a bit of research but I haven't had any luck. Does anyone know of any algorithms that have been designed to tackle this kind of problem? I was really hoping that there would be some clever hash function or something that means I won't have to cycle through the tables, but from my limited knowledge something like that would struggle here. I feel confident that I understand the problem well enough to gradually optimise the solution I have at the moment, but I want to be sure I've not missed a much better solution.
Apologies for the very long and abstract description of the problem - hopefully it's clear what I'm trying to do. I'll amend my question if it's unclear.
Thanks for any help.
this is basically what a query optimizer does in SQL land. There are fast, free, in memory databases for exactly this purpose. Checkout sqlite https://www.sqlite.org/inmemorydb.html.
It sounds like you are doing what is called a 'full table scan' for each query, which is like the last resort for a query optimizer.
As I've understood, you want to select entries by criteria like
A& not B & x1 >= lower_x1 & x1 < upper_x1 & x2 >= lower_x2 & x2 < lower_x2 & ...
The easiest way is to have them sorted by all possible xi, where i=1,2.. in separate sets, and have separated 'words' for various combination of A,B,..
The search will works as follows:
Select a proper world by Boolean criteria combination
For each i, find the population of lower_xi..upper_xi range in corresponding set (this operation is O(log(N))
Select i where the population is the lowest
While iterating instances through lower_xi..upper_xi range, filter the results by checking other upper/lower bound criteria (for all xj where j!=i)
Note that this s a general solution. Of course if you know some relation between your bound(s), you may use a list sorted by respective combination(s) of item values.
I'm trying to figure out an algorithm...
Input is a bunch of objects that have multiple values (eg 3 values per object, colour/taste/age, though it could be more).
The algorithm would then distribute the objects into a pre-defined number of sets. Each set should end up with almost the same number of objects (preferably the object count per set shouldn't differ more than 1), and achieve the objective of as fair a distribution of values per set as possible (eg try to have close to as many red in each set, and same for other colours, as well as tastes and ages, etc).
Values are tied to objects and cannot be changed. If you move an object from one set to another it brings all its values.
I found this related question: Algorithm for fair distribution of numbers into two sets
and the "number partitioning problem" suggested seems to help with single value distributions, but I'm looking for information/algorithms with multiple values per object (as described above).
Also note that the values cannot be normalized, ie each object cannot be totalled up into a single value.
Thank you kindly for any assistance.
IMHO, you should approach this as a clustering problem http://en.wikipedia.org/wiki/Cluster_analysis .