For simplicity sake, I have the following dummy data:
id val
1 5
1 30
1 50
1 15
2 120
2 60
2 10
2 10
My desired output is the following:
id SUM_GT_10%
1 95%
2 90%
SUM_GT_10% can be obtained by the following steps:
Calculate the sum of val for each id
Divide val by 1
sum of 2 if 2 > 10%
using the example data, the sum of val is 100 for id 1 and 200 for id 2, so we would obtain the following additional columns:
id val 1 2
1 5 100 5%
1 30 100 30%
1 50 100 50%
1 15 100 15%
2 120 200 60%
2 60 200 30%
2 10 200 5%
2 10 200 5%
And our final output (step 3) would be sum of 2 where 2> 10%:
id SUM_GT_10%
1 95%
2 90%
I don't care about the intermediate columns, just the final output, of course.
James, you might want to create a temporary table in your measure and then sum its results:
tbl_SumVAL =
var ThisId = MAX(tbl_VAL[id])
var temp =
FILTER(
SELECTCOLUMNS(tbl_VAL, "id", tbl_VAL[id], "GT%",
tbl_VAL[val] / SUMX(FILTER(tbl_VAL, tbl_VAL[id] = ThisId), tbl_VAL[val])),
[GT%] > 0.1
)
return
SUMX(temp, [GT%])
The temp table is basically recreating two steps that you have described (divide "original" value by the sum of all values for each ID), and then leaving only those values that are greater than 0.1. Note that if your id is not a number, then you'd need to replace MAX(tbl_VAL[id]) with SELECTEDVALUE(tbl_VAL[id]).
The final result looks like that -
Also, make sure to set your id field to "Not Summarize", in case id is a number -
Related
Context:
Activity has a grade
Activities belong to a subject, and the subject_avg is simply the average of its activities grades in a determined time range
The global_avg is the avg of many subject_avg (i.e, not to be confused with the average of all activity grades)
Problem:
"Efficiently" calculate global_avg in variable time windows
"Efficiently" calculating subject_avg for a single subject, by accumulating the amount and grade of its activities:
date
grade
act1
day 1
0.5
act2
day 3
1
act3
day 3
0.8
act4
day 6
0.6
act5
day 6
0
avg_sum
activity_count
day 1
0.5
1
day 3
2.3
3
day 6
2.6
5
I called it "efficiently" because if I need subject_avg between any 2 dates, I can obtain it with simple arithmetic over the second table:
subject_avg (day 2 to 5) = (2.3 - 0.5) / (3 - 1) = 0.6
Calculating global_avg:
subjectA
avg_sum
activity_count
day 1
0.5
1
day 3
2.3
3
day 6
2.6
5
subjectB
avg_sum
activity_count
day 4
0.8
1
day 6
1.8
2
global_avg (day 2 to 5) = (subjectA_avg + subjectB_avg)/2 = (0.6 + 0.8) / 2 = 0.7
I have hundred of subjects, so I need to now: Is there any way I could pre-process the subject_avgs so that I don't need to individually calculate its averages in the given time window before calculating global_avg?
I have built a model with H2ORandomForestEstimator and the results shows something like this below.
The threshold keeps changing (0.5 from traning and 0.313725489027 from validation) and I like to fix the threshold in H2ORandomForestEstimator for comparison during fine tuning. Is there a way to set the threshold?
From http://h2o-release.s3.amazonaws.com/h2o/master/3484/docs-website/h2o-py/docs/modeling.html#h2orandomforestestimator, there is no such parameter.
If there is no way to set this, how do we know what threshold our model is built on?
rf_v1
** Reported on train data. **
MSE: 2.75013548238e-05
RMSE: 0.00524417341664
LogLoss:0.000494320913199
Mean Per-Class Error: 0.0188802936476
AUC: 0.974221763605
Gini: 0.948443527211
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.5:
0 1 Error Rate
----- ------ --- ------- --------------
0 161692 1 0 (1.0/161693.0)
1 3 50 0.0566 (3.0/53.0)
Total 161695 51 0 (4.0/161746.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- -------- -----
max f1 0.5 0.961538 19
max f2 0.25 0.955056 21
max f0point5 0.571429 0.983936 18
max accuracy 0.571429 0.999975 18
max precision 1 1 0
max recall 0 1 69
max specificity 1 1 0
max absolute_mcc 0.5 0.961704 19
max min_per_class_accuracy 0.25 0.962264 21
max mean_per_class_accuracy 0.25 0.98112 21
Gains/Lift Table: Avg response rate: 0.03 %
** Reported on validation data. **
MSE: 1.00535766226e-05
RMSE: 0.00317073755183
LogLoss: 4.53885183426e-05
Mean Per-Class Error: 0.0
AUC: 1.0
Gini: 1.0
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.313725489027:
0 1 Error Rate
----- ----- --- ------- -------------
0 53715 0 0 (0.0/53715.0)
1 0 16 0 (0.0/16.0)
Total 53715 16 0 (0.0/53731.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- ------- -----
max f1 0.313725 1 5
max f2 0.313725 1 5
max f0point5 0.313725 1 5
max accuracy 0.313725 1 5
max precision 1 1 0
max recall 0.313725 1 5
max specificity 1 1 0
max absolute_mcc 0.313725 1 5
max min_per_class_accuracy 0.313725 1 5
max mean_per_class_accuracy 0.313725 1 5
The threshold is max-F1.
If you want to apply your own threshold, you will have to take the probability of the positive class and compare it yourself to produce the label you want.
If you use your web browser to connect to the H2O Flow Web UI inside of H2O-3, you can mouse over the ROC curve and visually browse the confusion matrix for each threshold, which is convenient.
the data format i have is as follows:
When i use s2<- fill_(s1,c("Time")), it would use the last seen value..
however i would like all values of time listed below to repeat for each value of Animal
Group Animal Sex Time
1 1001 M 0
4
8
24
48
1 1002 M
1 1003 M
Is there a way I can use Clickhouse (Arrays?) to calculate sequential values that are dependent on previously calculated values.
For e.g.
On day 1, I start with 0 -- consume 5 -- Add 100 -- ending up with = 0 - 5 + 100 = 95
My day2, starts with what I ended up on day 1 which is 95 -- again consume 10 -- add 5 -- ending up with 95-10+5=90 (which will be the start for day3)
Given
ConsumeArray [5,10,25]
AddArray [100,5,10]
Calculate EndingPosition and (= StartingPosition for Next day)
-
Day1 Day2 Day3
--------------------------------------------------------------------
StartingPosition (a) = Previous Ending Position | 0 95 90 Calculate
Consumed (b) | 5 10 25
Added (c) | 100 5 10
EdingPosition (d) = a-b+c | 95 90 75 Calculate
Just finish all the add/consume operations first and then do an accumulation.
WITH [5,10,25] as ConsumeArray,
[100,5,10] as AddArray
SELECT
arrayCumSum(arrayMap((c, a) -> a - c, ConsumeArray, AddArray));
For example, if it is the choice of chocolate, ice cream, donut, ..., for the order of their preference.
If user 1 choose
A B C D E F G H I J
and user 2 chooses
J A B C I G F E D H
what are some good ways to calculate a score from 0 to 100 to tell how close their choices are? It has to make sense, such as if most answers are the same but just 1 or 2 answers different, the score cannot be made to extremely low. Or, if most answers are just "shifted by 1 position", then we cannot count them as "all different" and give 0 score for those differences of only 1 position.
Assign each letter item an integer value starting at 1
A=1, B=2, C=3, D=4, E=5, F=6 (stopping at F for simplicity)
Then consider the order the items are placed, use this as a multiple
So if a number is the first item, its multiplier is 1, if its the 6th item the multipler is 6
Figure out the maximum score you could have (basically when everything is in consecutive order)
item a b c d e f
order 1 2 3 4 5 6
value 1 2 3 4 5 6
score 1 4 9 16 25 36 Sum = 91, Score = 100% (MAX)
item a b d c e f
order 1 2 3 4 5 6
value 1 2 4 3 5 6
score 1 4 12 12 25 36 Sum = 90 Score = 99%
=======================
order 1 2 3 4 5 6
item f d b c e a
value 6 4 2 3 5 1
score 6 8 6 12 25 6 Sum = 63 Score = 69%
order 1 2 3 4 5 6
item d f b c e a
value 4 6 2 3 5 1
score 4 12 6 12 25 6 Sum = 65 Score = 71%
obviously this is a very crude implementation that I just came up with. It may not work for everything. Examples 3 and 4 are swapped by one position yet the score is off by 2% (versus ex 1 and 2 which are off by 1%). It's just a thought. I'm no algorithm expert. You could probably use the final number and do something else to it for a better numerical comparison.
You could
Calculate the edit distance between the sequences;
Subtract the edit distance from the sequence length;
Divide that by the length of the sequence
Multiply it by hundred
Score = 100 * (SequenceLength - Levenshtein( Sequence1, Sequence2 ) ) / SequenceLength
Edit distance is basically the number of operations required to transform sequence one in sequence two. An algorithm therefore is the Levenshtein distance algorithm.
Examples:
Weights
insert: 1
delete: 1
substitute: 1
Seq 1: ABCDEFGHIJ
Seq 2: JABCIGFEDH
Score = 100 * (10-7) / 10 = 30
Seq 1: ABCDEFGHIJ
Seq 2: ABDCFGHIEJ
Score = 100 * (10-3) / 10 = 70
The most straightforward way to calculate it is the Levenshtein distance, which is the number of changes that must be done to transform one string to another.
Disadvantage of Levenshtein distance for your task is that it doesn't measure closeness between products themselves. I.e. you will not know how A and J are close to each other. For example, user 1 may like donuts, and user 2 may like buns, and you know that most people who like first also like the second. From this information you can infer that user 1 makes choices that are close to choices of user 2, through they don't have same elements.
If this is your case, you will have to use one of two: statistical methods to infer correlation between choices or recommendation engines.