Moving average conditioned in Google Sheets - google-sheets-formula

I have a Google Sheet in which I have to calculate a moving average conditioned to the 'ID' that calculates the average of the last 3 periods.
Any idea on how to do it?
I leave an example with the final results (column "Mean Average (last 3)").
Regards!
ID value Mean Average (last 3)
1 12 12,00
1 19 12,00
1 19 15,50
1 18 16,67
1 13 18,67
2 11 11,00
2 18 11,00
2 15 14,50
2 17 14,67
2 11 16,67
3 11 11,00
3 16 11,00
3 10 13,50
3 11 12,33

I've got an answer that may work for you. Assuming that your sample data is in columns A4:C (see my sample sheet), try the following formula in column D, in the same row as your data headers.
={"Mean Avg";ArrayFormula(
IF(ROW(A4:A18)<ROW(A$4)+2,
C$4,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-1,0))),
B4:B19,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-2,0))),
B3:B18,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-3,0))),
(B2:B17+B3:B18)/2,
(B1:B16+B2:B17+B3:B18)/3)))))}
The first IF checks whether it is one of the first two data rows, to force the initial values.
The next IF checks if the ID is not equal to the row above, and forces the start of a new Average, with just one value. The next IF checks if it is the second ID in a series (NOT EQual to the ID 2 rows up), and if yes, also uses the single value from the row above.
The next IF checks up three rows, and if the IDs are different, it averages the values from the two rows above.
Otherwise, this is the fourth data row in a series with the same ID, and the formula takes the values from the three rows above, and averages them.
Due to the offsets, it seems quite sensitive to ranges, so it may need some tuning if you move it.
Let me know if this helps.

Related

Create values of new data frame variable based on other column values

I have a question about data set preparation. In a survey, the same people were asked about a number of different variables at two points of measurement. This resulted in a dataset in long format, i.e. information from each participant is stored in two rows. Each row represents the data of this person at the respective time of measurement (see example). Individuals have individual participation codes. The same participation code thus indicates that the data is from the same person.
code
time
risk_perception
DB6M
1
6
DB6M
2
4
TH4D
1
2
TH4D
2
3
Now I would like to create a new variable "risk_perception.complete", which shows me whether the information for each participant is complete. It could be that a person has not given any information at both measurement times or only at one of the two measurement times and therefore values are missing (NAs).In the new variable I would like to check and code this information for each person. If the person has one or more NAs, then a 0 should be coded there. If the person has no NAs, then there should be a 1 (see example).
code
time
risk_perception
risk_perception.complete
DB6M
1
6
1
DB6M
2
4
1
TH4D
1
2
1
TH4D
2
3
1
SU6H
1
NA
0
SU6H
2
3
0
VG9S
1
NA
0
VG9S
2
NA
0
Can anyone tell me the best way to program this?
Here is reproducible example:
data <- data.frame(
code = c("AH6M","AH6M","BD7M","BD7M","SH9L","SH9L"),
time = c(1,2,1,2,1,2),
risk = c(6,7,NA,3,NA,NA))
Thank you in advance and best regards!

SQL Server Reporting: How calculate value based on the previous calculated value int the same column?

I'm trying to calculate a row value based on the previous row value in the same column within a report expression. I can't precalculate this from database since starting point of calculation is dependent from input parameters and values in a table should be recalculated dynamically within report itself.
In Excel analogical data and formula look like as it is shown below (starting point is always 100):
B C D E
Price PreviousPrice CalcValue Formula
1 NULL NULL 100
2 2.6 2.5 104 B2/C2*D1
3 2.55 2.6 102 B3/C3*D2
4 2.6 2.55 104 B4/C4*D3
5 2.625 2.6 105 B5/C5*D4
6 2.65 2.625 106 B6/C6*D5
7 2.675 2.65 107 B7/C7*D6
I tried to calculate expected values ("CalcValue" is the name of column where expression is set) like this:
=Fields!Price.Value/ PreviousPrice.Value * Previous(reportitems("CalcValue").Value))
but got an error "Aggregate functions can be used only on report items contained in page headers and footers"
Can you please advice whether expected result is achievable in my case and suggest a solution?
Thank you in advance!
Sadly I'm still facing with issue: calculated column does not consider previous calculated value. E.g., I added CalcVal field with 100 as default and tried to calculate using above approach, like: =previous(runningValue(Fields!CalcVal.Value, sum, "DataSet1") ) * Fields!Price.Value/Fields!PreviousPrice.Value.
But in this case it always multiples Fields!Price.Value/Fields!PreviousPrice.Value by 100..
For example CalcVal on Fly always show 200
=previous(runningValue(Fields!CalcVal.Value, sum, "DataSet1")) * 2
https://imgur.com/Wtg3Wsg
I tried with your sample data, here is how I achieved the results
Formula to use, You might have to take care of null values
=Fields!Price.Value/(Fields!PreviousPrice.Value*Previous(Fields!CalcValue.Value))
Edit: Update to answer after Op's comment
CalcValue is caluated with below formula i.e on the fly
=RunningValue(CountDistinct("Tablix6"),Count,"Tablix6"*100
and then Final value as below
=Fields!Price.Value/(Fields!PreviousPrice.Value*
Previous(RunningValue(CountDistinct("Tablix6"),Count,"Tablix6"))*100)

how to make sure to never get ora-01438: value larger than specified precision allowed for this column?

I'm doing a division for each record and updating a certain column with the result
so my sql looks something like this
update table1 set frequency = num/denom where id>XXX
my frequency data type is number(10,10)
Based on https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT1838
First, I'm not even sure why I get this data because the answer will always be 0.XXX, so giving 10 before the comma would be a plenty. Then the 10 after the comma should be okay too because it will truncate if the answer is bigger.
NUMBER(10, 10) means 10 digits and a scale of 10.
That means you have 10 digits right of the decimal point which means no digit left of it.
So having the table
CREATE TABLE t
(
test NUMBER (10, 10)
);
insert into t values (0.9999999999); will work, while
insert into t values (0.99999999999);will fail because the value is rounded up to 1.
So if num/denom is 1 or even larger you will get ORA-01438: value larger than specified precision allowed for this column.
But you will also get this error, if num/denom is larger then 0.99999999995 as oracle tries to round it to 1.
First of all, let me get this confusion around the precision and scale cleared out. According to the documentation, it is stated:
For numeric columns, you can specify the column as:
column_name NUMBER
Optionally, you can also specify a precision
(total number of digits) and scale (number of digits to the right of
the decimal point):
column_name NUMBER (precision, scale)
In your case:
frequency NUMBER(10,10)
This means, that the total number of digits is 10 and this means that the column can accommodate values from:
0.0000000001
to:
9999999999
This includes Integers up to 9999999999 (10 nines) and floats from 0.0000000001 (9 zeroes and a 1 at the end).
Now that we know this, let's proceed to the problem..
You need this query to never fail with ORA-01438:
update table1 set frequency = num/denom where id>XXX;
You can do the following check, on update time:
update table1
set frequency = CASE LENGTH(TRUNC(num/denom)) >=10
THEN TRUNC(num/denom, 10)
ELSE
ROUND(num/denom), 10 - LENGTH(TRUNC(num/denom))) --TRUNC
END
where id>XXX;
What this would do is check:
1. If the whole part of the division is more than or equal to 10; in that case, return only the first 10 digits (TRUNCATE).
2. If the whole part is less than 10; in that case ROUND the result to "10 - LENGTH_OF_WHOLE_PART" decimal places, but still within the precision of 10, which is the one of the column.
*Note: The ROUND above will actually ROUND the result, giving you an inaccurate value. If you need to get a raw truncation of the result, use TRUNCATE instead of ROUND above!
Cheers

MapReduce, to extract the one row with highest value

This is the result(actula output) for reducer. The data is title(key),month and frequency of how many books are borrowed based on the book title(value), and Is there any way to get the only one row with highest value? For example, I want to choose the only row with highest frequency among lots of rows. If you know the way, please enlighten me. Thanks a lot.
"""E"" is for evidence [sound recording] / by Sue Grafton." 05 8
"""F"" is for fugitive [sound recording] / by Sue Grafton." 05 6
"""G"" is for Grafton : the world of Kinsey Millhone / Natalie Hevener Kaufman and Carol McGinnis Kay." 06 1
"""G"" is for gumshoe [text (large print)] / Sue Grafton." 09,10 1
"""Galapagos"" means ""tortoises"" / written and illustrated by Ruth Heller." 10,04,09 2
"""Git on board 09 1
"""God's banker"" / by Rupert Cornwell." 05,10,11 1
"""Gospodi-- spasi i usmiri Rossi︠i︡u"" : Nikolaĭ II 10,11 1
"""H"" is for homicide [sound recording] / by Sue Grafton." 12 4
Run a secondary mapreduce job accepting as input, the output of the first action. The values to write in the Mapper could be (NullWritable, line), as you want to collect all lines to a single reducer, but you don't really care about a key otherwise, then parse out the number of each line, keeping track of the current maximum value and its associated line. After looping over all values, write the maximum line.
To improve the run time of this process, use setCombinerClass in your job configuration to use this new reducer

Bash: find all pair of lines such that the difference of their first field is less than a threshold

my problem is the following. I have a BIG file with many rows containing ordered numbers (repetitions are possible)
1
1.5
3
3.5
6
6
...
1504054
1504056
I would like to print all the pair of row numbers such that their difference is smaller than a given threshold thr. Let us say for instance thr=2.01, I want
0 1
0 2
1 2
1 3
2 3
4 5
...
N-1 N
I wrote a thing in python but the file is huge and I think I need a smart way to do this in bash.
Actually, in the complete data structure there exists also a second column containing a string:
1 s0
1.5 s1
3 s2
3.5 s3
6 s4
6 s5
...
1504054 sN-1
1504056 sN
and, if easy to do, I would like to write in each row the pair of linked strings, possibly separated by "|":
s0|s1
s0|s2
s1|s2
s1|s3
s2|s3
s4|s5
...
sN-1|sN
Thanks for your help, I am not too familiar with bash
In any language you can white a program implementing this pseudo code:
while read line:
row = line.split(sep)
new_kept_rows = []
for kr in kept_rows :
if abs(kr[0], row[0])<=thr:
print "".join(kr[1:]) "|" "".join(row[1:])
new_kept_rows.append(kr)
kept_rows = new_kept_rows
This program only keep the few lines which could match the condition. All other are freed from memory. So the memory footprint should remain small even for big files.
I would use awk language because I'm comfortable with. But python would fit too (the pseudo code I give is very close to be python).

Resources