Suitable machine learning algorithm for column selection - algorithm

I am new in machine learning. In my work I require a machine learning algorithm to select some columns out of many columns in a 2D matrix depending on the spread of the data. Below is a sample of the 2D matrix:
400 700 4 1400
410 710 4 1500
416 716 4 1811
..............
410 710 4 1300
Previously I have used standard deviation method to select columns depending on some threshold values(as a measure of spread of data for a particular column). Observe that the 3rd column is constant and last column in varying tremendously. 1st and 2nd column in also varying but the spread of their data is small. By applying standard deviation on each of the columns I get (sigma) = 10, 10, 0, 200 respectively.
I have considered some experimental threshold values to discard some columns. If the (sigma) crosses the threshold value range then the corresponding column gets discarded. I calculated those threshold values manually. Though this method was very simple but dealing with the threshold values is a very tedious task as there are many existing columns.
For this reason I want to use a standard machine learning algorithm or somehow if I can make these threshold values adaptive. So that I don't require to hard-code the threshold values inside the code. Can anyone please suggest me an appropriate algorithm for this?

Related

Best method to identify and replace outlier for Salary column in python

What is best method to identify and replace outlier for ApplicantIncome,
CoapplicantIncome,LoanAmount,Loan_Amount_Term column in pandas python.
I tried IQR with seaborne boxplot, and tried to identified the outlet and fill with NAN record after that take mean of ApplicantIncome and filled with NAN records.
Try to take group of below combination column ex: gender, education, selfemployed, Property_Area
And having below column in my dataframe
Loan_ID LP001357
Gender Male
Married NaN
Dependents NaN
Education Graduate
Self_Employed No
ApplicantIncome 3816
CoapplicantIncome 754
LoanAmount 160
Loan_Amount_Term 360
Credit_History 1
Property_Area Urban
Loan_Status Y
Outliers
Just like missing values, your data might also contain values that diverge heavily from the big majority of your other data. These data points are called “outliers”. To find them, you can check the distribution of your single variables by means of a box plot or you can make a scatter plot of your data to identify data points that don’t lie in the “expected” area of the plot.
The causes for outliers in your data might vary, going from system errors to people interfering with the data through data entry or data processing, but it’s important to consider the effect that they can have on your analysis: they will change the result of statistical tests such as standard deviation, mean or median, they can potentially decrease the normality and impact the results of statistical models, such as regression or ANOVA.
To deal with outliers, you can either delete, transform, or impute them: the decision will again depend on the data context. That’s why it’s again important to understand your data and identify the cause for the outliers:
If the outlier value is due to data entry or data processing errors,
you might consider deleting the value.
You can transform the outliers by assigning weights to your
observations or use the natural log to reduce the variation that the
outlier values in your data set cause.
Just like the missing values, you can also use imputation methods to
replace the extreme values of your data with median, mean or mode
values.
You can use the functions that were described in the above section to deal with outliers in your data.
Following links will be useful for you:
Python data cleaning
Ways to detect and remove the outliers

suitable formula/algorithm for detecting temperature fluctuations

I'm creating an app to monitor water quality. The temperature data is updated every 2 min to firebase real-time database. App has two requirements
1) It should alert the user when temperature exceed 33 degree or drop below 23 degree - This part is done
2) It should alert user when it has big temperature fluctuation after analysing data every 30min - This part i'm confused.
I don't know what algorithm to use to detect big temperature fluctuation over a period of time and alert the user. Can someone help me on this?
For a period of 30 minutes, your app would give you 15 values.
If you want to figure out a big change in this data, then there is one way to do so.
You can use implement the following method:
Calculate the mean and the standard deviation of the values.
Subtract the data you have from the mean and then take the absolute value of the result.
Compare if the absolute value is greater than one standard deviation, if it is greater then you have a big data.
See this example for better understanding:
Lets suppose you have these values for 10 minutes:
25,27,24,35,28
First Step:
Mean = 27 (apprx)
One standard deviation = 3.8
Second Step: Absolute(Data - Mean)
abs(25-27) = 2
abs(27-27) = 0
abs(24-27) = 3
abs(35-27) = 8
abs(28-27) = 1
Third Step
Check if any of the subtraction is greater than standard deviation
abs(35-27) gives 8 which is greater than 3.8
So, there is a big fluctuation. If all the subtracted results are less than standard deviation, then there is no fluctuation.
You can still improvise the result by selecting two or three standard deviation instead of one standard deviation.
Start by defining what you mean by fluctuation.
You don't say what temperature scale you're using. Fahrenheit, Celsius, Rankine, or Kelvin?
Your sampling rate is a new data value every two minutes. Do you define fluctuation as the absolute value of the difference between the last point and current value? That's defensible.
If the max allowable absolute value is some multiple of your 33-23 = 10 degrees you're in business.

quantize/arrange/sequence numbers into a specific format

I'm trying to arrange multiple PCM audio data into a specific sequence.
the fact that it's audio data is just for context, the problem itself has nothing to do with audio/DSP.
my input is a varying set of files with varying lengths, and I'm trying to arrange the data sequentially into a new file and add padding after each segment where needed so that each input element is aligned to a grid which is integer-divisible by 120 units. In other words, I need to be able to address the beginning of each segment by choosing an offset between 0-119.
to illustrate the problem here is a trivial case example. two input files have the following byte lengths:
200
+ 400
---
= 600
in this case, there is no padding needed.
the files can be arranged back to back, as they fit into the 120-grid as is. in the grid, the 200-file has a range from 0-40 (40 units), the 400 file has a range from 40-120 (80 units).
this becomes trickier if any of the files do not fit into the grid.
199
+ 398
---
= 597
intuitively, it's easy to see that the 199-byte file needs 1 byte of padding at the end so that its length becomes 200, and the 398-byte file needs 2 bytes to become 400 bytes. We then have a nice 1:2 ratio between the 2 files, which in the 120-grid translates to 40 and 80 units.
now, I'm trying to find an algorithm which can do this for any number of input files from 1-120, where each file can have arbitrary non-zero length.
maybe there is an existing algorithm which does just that, but I'm finding it difficult to find descriptive keywords for the problem.
I've tried to solve this naively, but somehow I fail to grok the problem fully. Basically I need to grow the individual files so that their sizes are multiples of the smallest common denominator of the sum of their lengths - which to me is kind of a chicken/egg problem. If I grow the files so their ratios fit together, I also grow the sum of their lengths and I don't understand how to check both against the 120-grid...
edit: ok I think I got it:
https://gist.github.com/jpenca/b033122fcb2300c5e9e4
not sure how to prove correctness, but trying this with varying inputs seems to work ok.

How to understand the Server performance by using standard deviation in the Jmeter Summary report?

How does JMeter calculate Std Dev. total value? What is it based on?
Please see the image below:
Even though the summary row is named TOTAL the values in the row aren't the sum of the values above it. Rather it is a calculated value for the entire data set.
If there are 100 samples in row1 and 100 samples in row2, the total number of samples is 200. "Total" average is the average of all 200 samples and "Total" Std Dev. is the standard deviation of all 200 samples.
OVERALL would probably be a better word to use in place of TOTAL for the summary row.
As for what standard deviation means, this isn't a JMeter measure per se, but a statistical measure of the data set. It represents how much variation or spread is present in the data set.

Random distribution of data

How do I distribute a small amount of data in a random order in a much larger volume of data?
For example, I have several thousand lines of 'real' data, and I want to insert a dozen or two lines of control data in a random order throughout the 'real' data.
Now I am not trying to ask how to use random number generators, I am asking a statistical question, I know how to generate random numbers, but my question is how do I ensure that this the data is inserted in a random order while at the same time being fairly evenly scattered through the file.
If I just rely on generating random numbers there is a possibility (albeit a very small one) that all my control data, or at least clumps of it, will be inserted within a fairly narrow selection of 'real' data. What is the best way to stop this from happening?
To phrase it another way, I want to insert control data throughout my real data without there being a way for a third party to calculate which rows are control and which are real.
Update: I have made this a 'community wiki' so if anyone wants to edit my question so it makes more sense then go right ahead.
Update: Let me try an example (I do not want to make this language or platform dependent as it is not a coding question, it is a statistical question).
I have 3000 rows of 'real' data (this amount will change from run to run, depending on the amount of data the user has).
I have 20 rows of 'control' data (again, this will change depending on the number of control rows the user wants to use, anything from zero upwards).
I now want to insert these 20 'control' rows roughly after every 150 rows or 'real' data has been inserted (3000/20 = 150). However I do not want it to be as accurate as that as I do not want the control rows to be identifiable simply based on their location in the output data.
Therefore I do not mind some of the 'control' rows being clumped together or for there to be some sections with very few or no 'control' rows at all, but generally I want the 'control' rows fairly evenly distributed throughout the data.
There's always a possibility that they get close to each other if you do it really random :)
But What I would do is:
You have N rows of real data and x of control data
To get an index of a row you should insert i-th control row, I'd use: N/(x+1) * i + r, where r is some random number, diffrent for each of the control rows, small compared to N/x. Choose any way of determining r, it can be either gaussian or even flat distribution. i is an index of the control row, so it's 1<=i<x
This way you can be sure that you avoid condensation of your control rows in one single place. Also you can be sure that they won't be in regular distances from each other.
Here's my thought. Why don't you just loop through the existing rows and "flip a coin" for each row to decide whether you will insert random data there.
for (int i=0; i<numberOfExistingRows; i++)
{
int r = random();
if (r > 0.5)
{
InsertRandomData();
}
}
This should give you a nice random distribution throughout the data.
Going with the 3000 real data rows and 20 control rows for the following example (I'm better with example than with english)
If you were to spread the 20 control rows as evenly as possible between the 3000 real data rows you'd insert one at each 150th real data row.
So pick that number, 150, for the next insertion index.
a) Generate a random number between 0 and 150 and subtract it from the insertion index
b) Insert the control row there.
c) Increase insertion index by 150
d) Repeat at step a)
Of course this is a very crude algorithm and it needs a few improvements :)
If the real data is large or much larger than the control data, just generate interarrival intervals for your control data.
So pick a random interval, copy out that many lines of real data, insert control data, repeat until finished. How to pick that random interval?
I'd recommend using a gaussian deviate with mean set to the real data size divided by the control data size, the former of which could be estimated if necessary, rather than measured or assumed known. Set the standard deviation of this gaussian based on how much "spread" you're willing to tolerate. Smaller stddev means a more leptokurtic distribution means tighter adherence to uniform spacing. Larger stdev means a more platykurtic distribution and looser adherence to uniform spacing.
Now what about the first and last sections of the file? That is: what about an insertion of control data at the very beginning or very end? One thing you can do is to come up with special-case estimates for these... but a nice trick is as follows: start your "index" into the real data at minus half the gaussian mean and generate your first deviate. Don't output any real data until your "index" into the real data is legit.
A symmetric trick at the end of the data should also work quite well (simply: keep generating deviates until you reach an "index" at least half the gaussian mean beyond the end of the real data. If the index just before this was off the end, generate data at the end.
You want to look at more than just statistics: it's helpful in developing an algorithm for this sort of thing to look at rudimentary queueing theory. See wikipedia or the Turing Omnibus, which has a nice, short chapter on the subject whose title is "Simulation".
Also: in some circumstance non-gaussian distributions, particularly the Poisson distribution, give better, more natural results for this sort of thing. The algorithm outline above still applies using half the mean of whatever distribution seems right.

Resources