How to remove outliers from all columns that are more than +/- 3 SD from the mean? - outliers

I have a dataframe with a length of 1168 and 270 columns.
The goal is the title: Remove outliers from all 270 columns that are more than +/- 3 standard deviations away from the mean.
My code is the following. However, it only keeps 40 datapoints. This doesnt make sense since it originally has 1168 rows, which means its only keeping 3% of the entire dataset.
from scipy import stats
len(df[(np.abs( stats. zscore(df)) < 3). all(axis = 1)])

I think I can tell you what's wrong, at least: .all(axis=1) collapse the columns of a matrix to a row vector, with true values if all elements of the corresponding column in the input matrix are true. Meaning that you have 40 columns containing only values within +-3 std.
See docs:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.all.html
I think this should work:
df[np.abs(stats.zscore(df)) < 3)].count()

Related

Dynamic reduction of elements in vector

I have a vector which contains several different values, where all of them are between 0 and 1.
I have also two different values, called min and max, that represent the minimum and maximum values; this two values may change in time.
I would reduce dynamically the dimension of a vector, which values must be included within the gap described by min and max.
For example,
at time t=1 I have that vector:
a=[0.5,0.2,0.6,0.3,0.2187,0.8798,0.5432,0.3563,0.3981,0.7845];
min=0.3;
max=0.7;
given vector a, and the two values (min and max), the new vector: a_new,
should be:
a_new=[0.5,0.6,0.3,0.5432,0.3563,0.3981];
this due to the fact that the min and max values decide which is the bound such that a new vector, starting from the original is defined.
Code solution
If you just want to generate a new vector given the old one, use the following syntax:
a_new = a(a>=min & a<=max);
If you also want to calculate the positions of each the deleted and non deleted values, use MATLAB's find function:
nonDeleteIndices = find(a>=min & a<=max);
deletedIndices= find(a<min | a>max);
Result
a_new =
0.5000 0.6000 0.3000 0.5432 0.3563 0.3981
nonDeletedIndices=
1 3 4 7 8 9
deletedIndices=
2 5 6 10
Suggestion
I suggest using different variable names other than min and max - such as minVal and maxVal. There are already MATLAB functions with these names and you don't want to override them.

How does ORACLE DB sum NUMBER(*,s) with many records?

I am wondering how Oracle sums NUMBER(9,2) with SUM(numWithScale/7).
This is because I am wondering how the error will propagate with a large amount of records
Let's say I have a table EMP_SAL with some EMP_ID, numWithScale, numWithScale being a salary.
To make it simple, let us make the numWithScale column NUMBER(9,2) 9 decimals of precision with 2 decimals to round to. All of these numbers in the table are random digits from 10.00-20.00 (ex. 10.12, 20.00, 19.95)
I divide by 7 in my calculation to give random digits at the end that round up or down.
Now, I sum all of the employees salaries with SUM(numWithScale/7).
Will the sum round each time it adds a record? Or does Oracle round after the calculation is complete? i.e. the error can be +/-0.01 from rounding, and with many additions then roundings, error adds up. Or does it round at the end? Thus I dont have to worry about the error adding up (unless I use the result in many more calculations)
Also, will Oracle return the sum as the more precise NUMBER, (38 digit precision, floating point)? or will it round up to the second digit NUMBER(9,2) when returning the value?
Will MSSQL behave pretty much the same way (even though syntax is different?
Oracle performs operation in the order you specified.
So, if you write this query:
select SUM(numWithScale/7) from some_table -- (1)
each of values divided by 7 and rounded to maximum available precision: NUMBER with 38 significant digits. After that all digits are summed.
In case of this query:
select SUM(numWithScale)/7 from some_table -- (2)
all numWithScale values are summed and only after that divided by 7. In this case there are no precision loss for each record, only result of sum() division by 7 are rounded to 38 significant digits.
This problem are common for calculation algorithms. Each time when you divide value by 7 you produce small calculation error because of limited number of digits, representing a number:
numWithScale/7 => quotient + delta.
While summing this values you got
sum(quotient) + sum(delta).
If numWithScale represents ideal uniform distribution and and a some_table contains infinite number of records, then sum(delta) tends to zero. But it happens only in theory. In practical cases sum(delta) grows and introduces significant error. This is a case of query(1).
On the other hand, summing can't introduce a rounding error if implemented properly. So for query (2) rounding error introduced only in last step, when whole sum divided by 7. Therefore value of delta for this query not affected by number of records.
Number scale and precision is only relevant as column or variable constraint.
When you attempt to store a number that exceeds defined precision it will raise an exception:
create table num (a number(5,2));
insert into num values (123456.789);
=> ORA-01438: value larger than specified precision allowed for this column
When you attempt to store a number that exceeds defined scale it will be rounded:
insert into num values (123.456789);
select a from num;
=> 123.46
Precision and scale do not matter when you read data and perform any calculations on it...
select 100000 + a / 100 from num;
=> 100001.2346
...unless you want to store it back into column with constraints, so above rules apply:
update num set a = a / 100;
select a from num;
=> 1.23
numWithScale/7 will be converted to NUMBER (i.e. it will not be rounded to number(9,2)).

Algorithm to smooth numbers with variable input time

I have an app that accepts integers at a variable rate every .25 to 2 seconds.
I'd like to output the data in a smoothed format for 3, 5 or 7 seconds depending on user input.
If the data always came in at the same rate, let's say every .25 seconds, then this would be easy. The variable rate is what confuses me.
Data might come in like this:
Time - Data
0.25 - 100
0.50 - 102
1.00 - 110
1.25 - 108
2.25 - 107
2.50 - 102
ect...
I'd like to display a 3 second rolling average every .25 seconds on my display.
The simplest form of doing this is to put each item into an array with a time stamp.
array.push([0.25, 100])
array.push([0.50, 102])
array.push([1.00, 110])
array.push([1.25, 108])
ect...
Then every .25 seconds I would read through the array, back to front, until I got to a time that was less than now() - rollingAverageTime. I would sum that and display it. I would then .Shift() the beginning of the array.
That seems not very efficient though. I was wondering if someone had a better way to do this.
Why don't you save the timestamp of the starting value and then accumulate the values and the number of samples until you get a timestamp that is >= startingTime + rollingAverageTime and then divide the accumulator by the number of samples taken?
EDIT:
If you want to preserve the number of samples, you can do this way:
Take the accumulator, and for each input value sum it and store the value and the timestamp in a shift register; at every cycle, you have to compare the latest sample's timestamp with the oldest timestamp in the shift register plus the smoothing time; if it's equal or more, subtract the oldest saved value from the accumulator, delete that entry from the shift register and output the accumulator, divided by the smoothing time. If you iterate you obtain a rolling average with (i think) the least amount of computation for each cycle:
a sum (to increment the accumulator)
a sum and a subtraction (to compare the timestamp)
a subtraction (from the accumulator)
a division (to calculate the average, done in a smart way can be a shift right)
For a total of about 4 algebric sums and a division (or shift)
EDIT:
For taking into account the time from the last sample as a weighting factor, you can divide the value for the ratio between this time and the averaging time, and you obtain an already weighted average, without having to divide the accumulator.
I added this part because it doesn't add computational load, so you can implement quite easy if you want to.
The answer from clabacchio has the basics right, but perhaps you need a bit more sophisticated answer.
Calculating the average:
0.25 - 100
0.50 - 102
1.00 - 110
In the above subset of the data what is the answer you want? You could use the mean of these numbers or you could do it in a weighted fashion. You could convert the data into:
0.50 - 0.25 = 0.25 ---- (100+102)/2 = 101
1.00 - 0.50 = 0.50 ---- (102+110)/2 = 106
Then you can take the weighted average of these values, weight being the time difference, and value being the average value.
The final answer = (0.25*101 + 0.5*106)/(0.25+0.5) = whatever the value is.
Now coming to "moving" averages:
You can either use previous k values or previous k seconds worth of data. In both cases you can keep two sums: weighted sum and sum of weights.
So... the worst case scenario is 4 readings per second over 7 seconds = 28 values in your array to process. That will be done in nanoseconds anyway, so not worth optimizing IMHO.

How to balance the number of items across multiple columns

I need to find out a method to determine how many items should appear per column in a multiple column list to achieve the most visual balance. Here are my criteria:
The list should only be split into multiple columns if the item count is greater than 10.
If multiple columns are required, they should contain no less than 5 (except for the last column in case of a remainder) and no more than 10 items.
If all columns cannot contain an equal number of items
All but the last column should be equal in number.
The number of items in each column should be optimized to achieve the smallest difference between the last column and the other column(s).
Well, your requirements and your examples appear a bit contradictory. For instance, your second example could be divided into two columns with 11 items in each, and satisfy your criteria. Let's assume that for rule #2 you meant that there should be <= 10 items / column.
In addition, I think you need to add another rule to make the requirements sensible:
The number of columns must not be greater than what is required to accomodate overflow.
Otherwise, you will often end up with degenerate solutions where you have far more columns than you need. For example, in the case of 26 items you probably don't want 13 columns of 2 items each.
If that's case, here's a simple calculation that should work well and is easy to understand:
int numberOfColumns = CEILING(numberOfItems / 10);
int numberOfItemsPerColumn = CEILING(numberOfItems / numberOfColumns);
Now you'll create N-1 columns of items (having `numberOfItemsPerColumn each) and the overflow will go in the last column. By this definition, the overflow should be minimized in the last column.
If you want to automatically determine the appropriate number of columns, and have no restrictions on its limits, I would suggest the following:
Calculate the square root of the total number of items. That would make an squared layout.
Divide that number by 1.618, and assign that to the total number of rows.
Multiply that same number by 1.618, and assign that to the total number of columns.
All columns but the right most one will have the same number of items.
By the way, the constant 1.618 is the Golden Ratio. That will achieve a more pleasant layout than a squared one.
Divide and multiply the other way round for vertical displays.
Hope this algorithm helps anyone with a similar problem.
Here's what you're trying to solve:
minimize y - z where n = xy + z and 5 <= y <= 10 and 0 <= z <= y
where you have n items split into x full columns of y items and one remainder column of z items.
There is almost certainly a smart way of doing this, but given these constraints a brute force implementation exploring all 6 + 7 + 8 + 9 + 10 = 40 possible combinations for y and z would take no time at all (only assignments where (n - z) mod y = 0 are solutions).
I think a brute force solution is easy, given the constraint on the number of items per columns: let v be the number of items per column (except the last one), then v belongs to [5,10] and can thus take a whooping 6 different values.
Evaluating 6 values is easy enough. Python one-liner (or not so far) to prove it:
# compute the difference between the number of items for the normal columns
# and for the last column, lesser is better
def helper(n,v):
modulo = n % v
if modulo == 0: return 0
else: return v - modulo
# values can only be in [5,10]
# we compute the difference with the last column for each
# build a list of tuples (difference, - number of items)
# (because the greater the value the better, it means less columns)
# extract the min automatically (in case of equality, less is privileged)
# and then pick the number of items from the tuple and re-inverse it
def compute(n): return - min([(helper(n,v), -v) for v in [5,6,7,8,9,10]])[1]
For 77 this yields: 7 meaning 7 items per columns
For 22 this yields: 8 meaning 8 items per columns

Simulating "Wheel of fortune" (Monte Carlo Simulation Hit or Miss Method)

I'm trying to make a randomizer that will use the Monte Carlo Hit or Miss Simulation.
I have a Key-Value pair that represents the ID and the probability value:
ID - Value
2 - 0.37
1 - 0.35
4 - 0.14
3 - 0.12
When you add all of those values, you will get a total of 1.0.
You can imagine those values as the total area of a "slice" on the "wheel" (EG: ID 2 occupies 37% of the wheel, while ID 3 only occupies 12% of the wheel). When converted to "range" it will look like this:
ID - Value - Range
2 - 0.37 - 0 to 37
1 - 0.35 - 37 to 72
4 - 0.14 - 72 to 86
3 - 0.12- 86 to 100
Now, I am using Random.NextDouble() to generate a random value that is between 0.0 and 1.0. That random value will be considered as the "spin" on the wheel. Say, the randomizer returns 0.35, then ID 2 will be selected.
What is the best way to implement this given that I have an array of doubles?
The simplest solutions are often the best, if your range is 0 - 100 by design (or another manageebly small number), you can allocate an int[] and use the table of ranges you created to fill in the ID at the corresponding index, your "throw" will then look like:
int randomID = rangesToIDs[random.nextInt(rangesToIDs.length)];
Btw, it is not necessary to sort the ID's on range size, as the randoms are assumed to be distributed uniformly it does not matter where in the lookup table a range is placed. It only matters that the number of entries is proportional to the chance to throw an ID.
Let's assume your initial data is represented as array D[n], where D[i] = (id, p) and sum(D[i].p for i=0..n-1) == 1.
Build a second array P[n] such that P[i] = (q, id): P[i] = (sum(D[j].p for j in 0..i), D[j].id) -- i.e., convert individual probablity of each slice i into cumulative probability of all slices preceding i (inclusive). Note that, by definition, this array P is ordered by field q (i.e. by cumulative probability).
Now you can use binary search to find the slice chosen by the random number r (0 <= r <= 1):
find highest i such that P[i].q <= r; then P[i].id is your slice.
It is possible to speed up the lookup further by hashing the probability range with a fixed grid. I can write more details on this if anybody is interested.
As jk wrote sorted dictionary of should be fine.
let's say you got dictionary like this:
0.37 2
0.72 1
0.86 4
1.00 3
You roll xx = 0.66..
Iterate through dictionary starting from lowest number (that's 0.37)
if xx < dict[i].key
return dict[i].value
Or another solution which comes to my mind is List of custom objects containing lower and upper bound and value. You iterate then through list and check if rolled number is in range of up and low bounds.
a sorted map/dictionary with the 'Value' as the key and the 'ID' as the value would allow you to quickly find the upper bound of the range you are in and then look up the ID for that range
assuming your dictionary allows it, a binary search would be better to find the upper bound than interating throught the entire dictionary
boundaries = [37, 72, 86, 100]
num = 100 * random
for i in boundaries:
if num < i then return i

Resources