Subtract integer from time in Google Speadsheet - time

Let's say I want to calculate my overtime in Google Spreadsheets. I need the function to either convert an integer (10) to a duration (10:00:00), or vice versa, to be able to perform a calculation.
So,
=A1 - 8
or
=A1 - 8:00:00
should be converted to
10:30:00 - 8:00:00
and return
02:30:00
in order to continue the calculation in another cell.

To convert an integer (or a fraction) representing number of hours to a duration, just divide it by 24. If you put the value 10.5 in A1, then put this formula in B1
=A1/24-"08:00:00"
And format B1 as Duration, you will get 2:30:00 as the cell value. The "real" underlying value, though, will be 0.1041666667 which corresponds to 2.5/24th of a day.

Related

Finding the 'Outliers' in numeric data set

I want to compare (sorty by) growth rates and disadvantage high rates with very low starting values.
Example:
1.
Start: 1.000.000
End: 1.100.000
Growth: +10%
Start: 100.000
End: 120.000
Growth: +20%
3.
Start: 1
End: 10
Growth: +900%
Start: 10
End: 15
Growth: +50%
Sorting just by growth, descending would result in: 900% (3.), 50% (4.), 20% (2.), 10% (1.)
But I want to have: 20% (2.), 10% (1.), 900% (3.), 50% (4.), because in my case the chance is high, that 3. and 4. are statistical outliers.
What's the best way to solve this problem and do I've to define a threshold for the start values?
Thanks!
Based on the description you have provided, the problem can be split into 2:
Finding and excluding Statistical Outliers from the data set
Sorting the resulting values in descending (or just in any) order
The general solution to the first problem and example using Microsoft Excel is described at : Statistical Outliers detection in Microsoft Excel worksheet (http://www.codeproject.com/Tips/214330/Statistical-Outliers-detection). Following is a bit of theory and a sample pertinent to your case.
Finding "Outliers" in a data set could be done by calculating the deviation for each number, expressed as either a "Z-score" or "modified Z-score" and testing it against certain predefined threshold. Z-score typically refers to number of standard deviation relative to the statistical average (in other words, it's measured in "Sigmas"). Modified Z-score applies the median computation technique to measure the deviation and in many cases provides more robust statistical detection of outliers. Mathematically the Modified Z-score could be written (as suggested by Iglewicz and Hoaglin - see the referenced article) as:
Mi = 0.6745 * (Xi - Median(Xi)) / MAD,
where MAD stands for Median Absolute Deviation. Any number in a data set with the absolute value of modified Z-score exceeding 3.5 is considered an "Outlier". Modified Z-score could be used to detect outliers in Microsoft Excel worksheet pertinent to your case as described below.
Step 1. Open a Microsoft Excel worksheet and in Cells A1, A2, A3 and A4 enter the values: 900%, 50% 20% and 10%, correspondingly.
Step 2. In C1 enter the formula: =MEDIAN(A1:A4) . The value in this cell corresponds to the median calculated on a data set entered at step 1.
Step 3. In C2 enter the array formula: {=MEDIAN(ABS(MEDIAN(A1:A4)-A1:A4))} . As a reminder, in order to enter the array formula, select the cell, type the formula in Excel Formula Bar and then click on the combination: CTRL-SHIFT-ENTER (notice the curly brackets surrounding the expression, which indicates the array formula). The value in this cell (C2) corresponds to MAD.
Step 4. Enter the formula: =IF((0.6745*ABS(C$1-A1)>3.5*C$2), "OUTLIER", "NORMAL") in the first row of column B and extend it down to the 4th row. Final result of “Outlier’s detection” should appear in column B.
A B C
900% OUTLIER 35%
50% NORMAL 0.35
20% NORMAL
10% NORMAL
thus the value 900% is found an "Outlier" while other values are OK. Sorting the result set will be just a trivial task.
Excel Worksheet example is included for the clarity of explanation. The algorithm itself could be implemented in any programming languages (VBA, C#, Java, etc). Hope this will help.
my solition
private static List<double> StatisticalOutLierAnalysis(List<double> allNumbers)
{
List<double> normalNumbers = new List<double>();
List<double> outLierNumbers = new List<double>();
double avg = allNumbers.Average();
double standardDeviation = Math.Sqrt(allNumbers.Average(v => Math.Pow(v - avg, 2)));
foreach (double number in allNumbers)
{
if ((Math.Abs(number - avg)) > (2 * standardDeviation))
outLierNumbers.Add(number);
else
normalNumbers.Add(number);
}
return normalNumbers;
}

How does ORACLE DB sum NUMBER(*,s) with many records?

I am wondering how Oracle sums NUMBER(9,2) with SUM(numWithScale/7).
This is because I am wondering how the error will propagate with a large amount of records
Let's say I have a table EMP_SAL with some EMP_ID, numWithScale, numWithScale being a salary.
To make it simple, let us make the numWithScale column NUMBER(9,2) 9 decimals of precision with 2 decimals to round to. All of these numbers in the table are random digits from 10.00-20.00 (ex. 10.12, 20.00, 19.95)
I divide by 7 in my calculation to give random digits at the end that round up or down.
Now, I sum all of the employees salaries with SUM(numWithScale/7).
Will the sum round each time it adds a record? Or does Oracle round after the calculation is complete? i.e. the error can be +/-0.01 from rounding, and with many additions then roundings, error adds up. Or does it round at the end? Thus I dont have to worry about the error adding up (unless I use the result in many more calculations)
Also, will Oracle return the sum as the more precise NUMBER, (38 digit precision, floating point)? or will it round up to the second digit NUMBER(9,2) when returning the value?
Will MSSQL behave pretty much the same way (even though syntax is different?
Oracle performs operation in the order you specified.
So, if you write this query:
select SUM(numWithScale/7) from some_table -- (1)
each of values divided by 7 and rounded to maximum available precision: NUMBER with 38 significant digits. After that all digits are summed.
In case of this query:
select SUM(numWithScale)/7 from some_table -- (2)
all numWithScale values are summed and only after that divided by 7. In this case there are no precision loss for each record, only result of sum() division by 7 are rounded to 38 significant digits.
This problem are common for calculation algorithms. Each time when you divide value by 7 you produce small calculation error because of limited number of digits, representing a number:
numWithScale/7 => quotient + delta.
While summing this values you got
sum(quotient) + sum(delta).
If numWithScale represents ideal uniform distribution and and a some_table contains infinite number of records, then sum(delta) tends to zero. But it happens only in theory. In practical cases sum(delta) grows and introduces significant error. This is a case of query(1).
On the other hand, summing can't introduce a rounding error if implemented properly. So for query (2) rounding error introduced only in last step, when whole sum divided by 7. Therefore value of delta for this query not affected by number of records.
Number scale and precision is only relevant as column or variable constraint.
When you attempt to store a number that exceeds defined precision it will raise an exception:
create table num (a number(5,2));
insert into num values (123456.789);
=> ORA-01438: value larger than specified precision allowed for this column
When you attempt to store a number that exceeds defined scale it will be rounded:
insert into num values (123.456789);
select a from num;
=> 123.46
Precision and scale do not matter when you read data and perform any calculations on it...
select 100000 + a / 100 from num;
=> 100001.2346
...unless you want to store it back into column with constraints, so above rules apply:
update num set a = a / 100;
select a from num;
=> 1.23
numWithScale/7 will be converted to NUMBER (i.e. it will not be rounded to number(9,2)).

Algorithm to smooth numbers with variable input time

I have an app that accepts integers at a variable rate every .25 to 2 seconds.
I'd like to output the data in a smoothed format for 3, 5 or 7 seconds depending on user input.
If the data always came in at the same rate, let's say every .25 seconds, then this would be easy. The variable rate is what confuses me.
Data might come in like this:
Time - Data
0.25 - 100
0.50 - 102
1.00 - 110
1.25 - 108
2.25 - 107
2.50 - 102
ect...
I'd like to display a 3 second rolling average every .25 seconds on my display.
The simplest form of doing this is to put each item into an array with a time stamp.
array.push([0.25, 100])
array.push([0.50, 102])
array.push([1.00, 110])
array.push([1.25, 108])
ect...
Then every .25 seconds I would read through the array, back to front, until I got to a time that was less than now() - rollingAverageTime. I would sum that and display it. I would then .Shift() the beginning of the array.
That seems not very efficient though. I was wondering if someone had a better way to do this.
Why don't you save the timestamp of the starting value and then accumulate the values and the number of samples until you get a timestamp that is >= startingTime + rollingAverageTime and then divide the accumulator by the number of samples taken?
EDIT:
If you want to preserve the number of samples, you can do this way:
Take the accumulator, and for each input value sum it and store the value and the timestamp in a shift register; at every cycle, you have to compare the latest sample's timestamp with the oldest timestamp in the shift register plus the smoothing time; if it's equal or more, subtract the oldest saved value from the accumulator, delete that entry from the shift register and output the accumulator, divided by the smoothing time. If you iterate you obtain a rolling average with (i think) the least amount of computation for each cycle:
a sum (to increment the accumulator)
a sum and a subtraction (to compare the timestamp)
a subtraction (from the accumulator)
a division (to calculate the average, done in a smart way can be a shift right)
For a total of about 4 algebric sums and a division (or shift)
EDIT:
For taking into account the time from the last sample as a weighting factor, you can divide the value for the ratio between this time and the averaging time, and you obtain an already weighted average, without having to divide the accumulator.
I added this part because it doesn't add computational load, so you can implement quite easy if you want to.
The answer from clabacchio has the basics right, but perhaps you need a bit more sophisticated answer.
Calculating the average:
0.25 - 100
0.50 - 102
1.00 - 110
In the above subset of the data what is the answer you want? You could use the mean of these numbers or you could do it in a weighted fashion. You could convert the data into:
0.50 - 0.25 = 0.25 ---- (100+102)/2 = 101
1.00 - 0.50 = 0.50 ---- (102+110)/2 = 106
Then you can take the weighted average of these values, weight being the time difference, and value being the average value.
The final answer = (0.25*101 + 0.5*106)/(0.25+0.5) = whatever the value is.
Now coming to "moving" averages:
You can either use previous k values or previous k seconds worth of data. In both cases you can keep two sums: weighted sum and sum of weights.
So... the worst case scenario is 4 readings per second over 7 seconds = 28 values in your array to process. That will be done in nanoseconds anyway, so not worth optimizing IMHO.

How to decide on weights?

For my work, I need some kind of algorithm with the following input and output:
Input: a set of dates (from the past). Output: a set of weights - one weight per one given date (the sum of all weights = 1).
The basic idea is that the closest date to today's date should receive the highest weight, the second closest date will get the second highest weight, and so on...
Any ideas?
Thanks in advance!
First, for each date in your input set assign the amount of time between the date and today.
For example: the following date set {today, tomorrow, yesterday, a week from today} becomes {0, 1, 1, 7}. Formally: val[i] = abs(today - date[i]).
Second, inverse the values in such a way that their relative weights are reversed. The simplest way of doing so would be: val[i] = 1/val[i].
Other suggestions:
val[i] = 1/val[i]^2
val[i] = 1/sqrt(val[i])
val[i] = 1/log(val[i])
The hardest and most important part is deciding how to inverse the values. Think, what should be the nature of the weights? (do you want noticeable differences between two far away dates, or maybe two far away dates should have pretty equal weights? Do you want a date which is very close to today have an extremely bigger weight or a reasonably bigger weight?).
Note that you should come up with an inverting procedure where you cannot divide by zero. In the example above, dividing by val[i] results in division by zero. One method to avoid division by zero is called smoothing. The most trivial way to "smooth" your data is using the add-one smoothing where you just add one to each value (so today becomes 1, tomorrow becomes 2, next week becomes 8, etc).
Now the easiest part is to normalize the values so that they'll sum up to one.
sum = val[1] + val[2] + ... + val[n]
weight[i] = val[i]/sum for each i
Sort dates and remove dups
Assign values (maybe starting from the farthest date in steps of 10 or whatever you need - these value can be arbitrary, they just reflect order and distance)
Normalize weights to add up to 1
Executable pseudocode (tweakable):
#!/usr/bin/env python
import random, pprint
from operator import itemgetter
# for simplicity's sake dates are integers here ...
pivot_date = 1000
past_dates = set(random.sample(range(1, pivot_date), 5))
weights, stepping = [], 10
for date in sorted(past_dates):
weights.append( (date, stepping) )
stepping += 10
sum_of_steppings = sum([ itemgetter(1)(x) for x in weights ])
normalized = [ (d, (w / float(sum_of_steppings)) ) for d, w in weights ]
pprint.pprint(normalized)
# Example output
# The 'date' closest to 1000 (here: 889) has the highest weight,
# 703 the second highest, and so forth ...
# [(151, 0.06666666666666667),
# (425, 0.13333333333333333),
# (571, 0.2),
# (703, 0.26666666666666666),
# (889, 0.3333333333333333)]
How to weight: just compute the difference of all dates and the current date
x(i) = abs(date(i) - current_date)
you can then use different expression to assign weights:
w(i) = 1/x(i)
w(i) = exp(-x(i))
w(i) = exp(-x(i)^2))
use gaussian distribution - more complicated, do not recommend
Then use normalized weights: w(i)/sum(w(i)) so that the sum is 1.
(Note that the exponential func is always used by statisticians in survival analysis)
The first thing that comes to my mind to to use a geometric series:
http://en.wikipedia.org/wiki/Geometric_series
(1/2)+(1/4)+(1/8)+(1/16)+(1/32)+(1/64)+(1/128)+(1/256)..... sums to one.
Yesterday would be 1/2
2 days ago would be 1/4
and so on
Is is the index for the i-th date.
Assign weights equal to to Ni / D.
D0 is the first date.
Ni is the difference in days between the i-th date and the first date D0.
D is the normalization factor
converts dates to yyyymmddhhmiss format (24 hours), add all these values ​​and the total, divide by the total time, and sort by this value.
declare #data table
(
Date bigint,
Weight float
)
declare #sumTotal decimal(18,2)
insert into #Data (Date)
select top 100
replace(replace(replace(convert(varchar,Datetime,20),'-',''),':',''),' ','')
from Dates
select #sumTotal=sum(Date)
from #Data
update #Data set
Weight=Date/#sumTotal
select * from #Data order by 2 desc

How can an Oracle NUMBER have a Scale larger than the Precision?

The documentation states: "Precision can range from 1 to 38. Scale can range from -84 to 127".
How can the scale be larger than the precision? Shouldn't the Scale range from -38 to 38?
The question could be why not ?
Try the following SQL.
select cast(0.0001 as number(2,5)) num,
to_char(cast(0.0001 as number(2,5))) cnum,
dump(cast(0.0001 as number(2,5))) dmp
from dual
What you see is that you can hold small numbers is that sort of structure
It might not be required very often, but I'm sure somewhere there is someone who is storing very precise but very small numbers.
According to Oracle Documentation:
Scale can be greater than precision, most commonly when ex notation is used (wherein decimal part can be so great). When scale is greater than precision, the precision specifies the maximum number of significant digits to the right of the decimal point. For example, a column defined as NUMBER(4,5) requires a zero for the first digit after the decimal point and rounds all values past the fifth digit after the decimal point.
Here's how I see it :
When Precision is greater than Scale (e.g NUMBER(8,5)), no problem, this is straightforward. Precision means the number will have a total of 8 digits, 5 of which are in the fractional part (.→), so the integer part (←.) will have 3 digits. This is easy.
When you see that Precision is smaller than Scale (e.g NUMBER(2, 5)), this means 3 things :
The number will not have any integer part, only fractional part. So the 0 in the integer part is not counted in the calculations, you say .12345 not 0.12345. In fact, if you specify just 1 digit in the integer part, it will always return an error.
The Scale represents the total number of digits in the fractional part that the number will have. 5 in this case. So it can be .12345 or .00098 but no more than 5 digits in total.
The fractional part is divided into 2 parts, significant numbers and zeros. Significant numbers are specified by Precision, and minimum number of zeros equals (Scale - Precision). Example :
here The number will must have a minimum of 3 zeros in the fractional part. followed by 2 significant numbers (could have a zero as well). So 3 zeros + 2 significant numbers = 5 which is the Scale number.
In brief, when you see for example NUMBER(6,9), this tells us that the fractional part will have 9 digits in total, starting by an obligatory 3 zeros and followed by 6 digits.
Here are some examples :
SELECT CAST(.0000123 AS NUMBER(6,9)) FROM dual; -- prints: 0.0000123; .000|012300
SELECT CAST(.000012345 AS NUMBER(6,9)) FROM dual; -- prints: 0.0000123; .000|012345
SELECT CAST(.123456 AS NUMBER(3,4)) FROM dual; -- ERROR! must have a 1 zero (4-3=1)
SELECT CAST(.013579 AS NUMBER(3,4)) FROM dual; -- prints: 0.0136; max 4 digits, .013579 rounded to .0136
Thanks to everyone for the answers. It looks like the precision is the number of significant digits.
select cast(0.000123 as number(2,5)) from dual
results in:
.00012
Where
select cast(0.00123 as number(2,5)) from dual
and
select cast(0.000999 as number(2,5)) from dual
both result in:
ORA-01438: value larger than specified precision allowed for this column
the 2nd one due to rounding.
According to Oracle Documentation:
Scale can be greater than precision, most commonly when e notation is used. When scale is greater than precision, the precision specifies the maximum number of significant digits to the right of the decimal point. For example, a column defined as NUMBER(4,5) requires a zero for the first digit after the decimal point and rounds all values past the fifth digit after the decimal point.
It is good practice to specify the scale and precision of a fixed-point number column for extra integrity checking on input. Specifying scale and precision does not force all values to a fixed length. If a value exceeds the precision, then Oracle returns an error. If a value exceeds the scale, then Oracle rounds it.
The case where Scale is larger than Precision could be summarized this way:
Number of digits on the right of decimal point = Scale
Minimum number of zeroes right of decimal = Scale - Precision
--this will work
select cast(0.123456 as number(5,5)) from dual;
returns 0.12346
-- but this
select cast(0.123456 as number(2,5)) from dual;
--will return "ORA-1438 value too large".
--It will not return err with at least 5-2 = 3 zeroes:
select cast(0.000123456 as number(2,5)) from dual;
returns 0.00012
-- and of course this will work too
select cast(0.0000123456 as number(2,5)) from dual;
returning 0.00001
Hmm as I understand the reference the precision is the count of digits.
maximum precision of 126 binary digits, which is roughly equivalent to 38 decimal digits
In oracle you have type NUMBER(precision,scale) where precision is total number of digits and scale is number of digits right of decimal point. Scale can be omitted, but it means zero. Precision can be unspecified (use i.e. NUMBER(*,10)) - this means total number of digits is as needed, but there are 10 digits right
If the scale is less than zero, the value will be rounded to scale digits left the decimal point.
I think that if you reserve more numbers right of the decimal point than there can be in the whole number, this means something like 0.00000000123456 but I am not 100% sure.

Resources