How to calculate average for each tumbling window? - apache-kafka-streams

I’m new in kafka streams and I’m really going crazy. I have a stream of counter values represented by <counter-name, counter-value, timestamp>. I want to calculate the average value for each day, like this:
counterValues topic content:
“cpu”, 10, “2022-06-03 17:00”
“cpu”, 20, “2022-06-03 18:00”
“cpu”, 30, “2022-06-04 10:00”
“memory”, 40, “2022-06-04 10:00”
and I want to obtain this output:
“cpu”, “2022-06-03”, 15
“cpu”, “2022-06-04”, 30
“memory”, “2022-06-04”, 40
This is a snippet of my code that it doesn’t work (it seems to calculate count)…
Duration windowSize = Duration.ofDays(1);
TimeWindows tumblingWindow = TimeWindows.of(windowSize);
counterValueStream
.groupByKey().windowedBy(tumblingWindow)
.aggregate(StatisticValue::new, (k, counterValue, statisticValue) -> {
statisticValue.setSamplesNumber(statisticValue.getSamplesNumber() + 1);
statisticValue.setSum(statisticValue.getSum() + counterValue.getValue());
return statisticValue;
}, Materialized.with(Serdes.String(), statisticValueSerde))
.toStream().map((Windowed<String> key, StatisticValue sv) -> {
double avgNoFormat = sv.getSum() / (double) sv.getSamplesNumber();
double formattedAvg = Double.parseDouble(String.format("%.2f", avgNoFormat));
return new KeyValue<>(key.key(), formattedAvg) ;
}).to("average", Produced.with(Serdes.String(), Serdes.Double()));
But the aggregation result is:
“cpu”, 1, “2022-06-03 17:00”
“cpu”, 1, “2022-06-03 18:00”
“cpu”, 1, “2022-06-04 10:00”
“memory”, 1, “2022-06-04 10:00”
Note that I use a TimestampExtractor that use counter timestamp instead of kafka record. What am I doing wrong?

Related

what to do to make shinyapp give me my output

The primary variable is AgeGroup which has 2 levels. I am trying to get the sample size to output, but for some reason the app either gives error or wont output anything. Can anyone help? The are some comments in the code to help with confusion
Code:
library(shiny)
library(shinyWidgets)
library(survival)
library(shinyjs)
library(survminer)
# Define UI for application that draws a histogram
ui <- fluidPage(
# Application title
titlePanel("ProMBA Haslam Ad Sample Size"),
#Put in all key5 inputs as numeric inputs that the user will type in and choose starting, default values for these inputs
tabPanel("Inputs",
div( id ="form",
column(4,
numericInput("power", label=h6("Power"), value = .9),
numericInput("alpha", label=h6("Alpha"), value = .05),
numericInput("precision", label=h6("Precision"), value =0.05),
numericInput("Delta", label=h6("Delta"), value=.3),
column(4,
numericInput("sample", label=h6("Starting Sample Size"), value = 40),
numericInput("reps", label=h6("Number of Replications"), value=1000)),
),
column(4,
#title of output
h4("Calculated Sample Size"),
verbatimTextOutput(("n"),placeholder=TRUE)),
#create action buttons for users to run the form and to reset the form
textOutput("Sample Size (n)"),
column(4,
actionButton("action","Calculate"))
)))
server = function(input,output,session){
buttonGo = eventReactive(input$action, {withProgress(message = "Running", {
#relist the key inputs and save them to be able to be used in the rest of the code
n<-input$sample/2
alpha<-input$alpha
power <- input$power
beta<-1-input$power
precision<-input$precision
delta <- input$Delta +1
rep <- input$reps
nincrease<-10
#manually5 load in the data from the baseline data .xlxs file
Reporting <- c("12/13/21","12/14/21","12/15/21","12/16/21","12/17/21","12/18/21","12/19/21","12/20/21","12/21/21","12/22/21","12/23/21","12/24/21","12/25/21","12/26/21","12/27/21","12/28/21","12/29/21","12/30/21","12/31/21","1/1/22")
AdSet <- "Status Quo"
Results <- c(70,52, 33, 84, 37, 41, 22, 53, 78, 66, 100, 110, 52, 43, 63, 84, 16, 64, 21, 69)
ResultIndicator <- "actions:link_click"
Budget <- 100
CostPerClick<- c(1.43, 1.92, 3.03, 1.19, 2.70, 2.44, 4.55, 1.89, 1.28, 1.52, 1.00, 0.91, 1.92, 2.33, 1.59, 1.19, 6.25, 1.56, 4.76, 1.45)
Impressions <- c(7020, 8430, 5850, 7920, 6890, 7150, 6150, 7370, 8440, 6590, 6750,8720, 6410,7720, 6940, 8010, 7520, 7190, 6540, 6020)
df <- data.frame(Reporting, AdSet, Results, ResultIndicator,Budget,CostPerClick,Impressions)
#define the standard deviation of the results as well as the mean for group 1 of the 2 level variable and the mean for group 2
mean1 = mean(df$Results)
sd1 = sd(df$Results)
mean2 = delta*mean1
click=rep(0,n)
#Create 2 level variable
AgeGroup <- rep(c("Age21-35","Age36-50"),each=n)
#create new data frame with 2 level variable and click repetitions
DataFrame2 <- data.frame(AgeGroup,click)
#create new data frame binding all of the input variables together
DataFrame3 <- data.frame(cbind(n,alpha,power,precision,delta,rep))
#create for loop to find the pvalue of the ttest run with click~AgeGroup
trials=function(){
for(i in 1:nrow(DataFrame2)){
if(any(DataFrame2$AgeGroup[i]=="Age21-35")){DataFrame2$click[i] =rnorm(1,mean1,sd1)}else{DataFrame2$click[i] =rnorm(1,mean2,sd1)}
}
pvalttest=t.test(click~AgeGroup, data=DataFrame2)
return(pvalttest$p.value)
}
p_values=replicate(200,trials())
p_values=replicate(input$rep,trials())
#find if the p value is significance
significance=p_values[p_values<alpha]
#find the power of the signifiance and the pvalue
power <- length(significance)/length(p_values)
print(c(power,n))
#run a while loop to find the n within the goal power limits
goalpower<-1-beta
lowergoal<-goalpower-input$precision
uppergoal<-goalpower+input$precision
while (power<lowergoal||power>uppergoal){
if (power<lowergoal){
n=n+nincrease
AgeGroup=c()
click=c()
AgeGroup=rep(c("Age21-35","Age36-50"), each=n)
click=rep(NA,2*n)
Dataframe2=data.frame(AgeGroup,click)
p_values=replicate(input$reps, trials())
significance=p_values[p_values<alpha]
power=length(significance)/length(p_values)
print(c(n, power))
}else{
nincrease=nincrease%/%(10/9) #%/% fixes issue of rounding
n=n-nincrease
AgeGroup=c()
click=c()
AgeGroup=rep(c("Age21-35","Age36-50"), each=n)
click=rep(NA,2*n)
DataFrame2=data.frame(AgeGroup,click)
p_values=replicate(input$reps, trials())
significance=p_values[p_values<alpha]
power=length(significance)/length(p_values)
print(c(n, power))
}
}
#n is defined as the sample size of one of the levels of the 2 level variable, so mulitply by 2 to get full sample size
n*2
})
}
shinyApp(ui, server)
i dont need the app to be pretty. I just want it to run whenever someone clicks the calculate button

Technical Analyis (MACD) for crpto trading

Background:
I have writing a crypto trading bot for fun and profit.
So far, it connects to an exchange and gets streaming price data.
I am using this price to create a technical indicator (MACD).
Generally for MACD, it is recommended to use closing prices for 26, 12 and 9 days.
However, for my trading strategy, I plan to use data for 26, 12 and 9 minutes.
Question:
I am getting multiple (say 10) price ticks in a minute.
Do I simply average them and round the time to the next minute (so they all fall in the same minute bucket)? Or is there is better way to handle this.
Many Thanks!
This is how I handled it. Streaming data comes in < 1s period. Code checks for new low and high during streaming period and builds the candle. Probably ugly since I'm not a trained developer, but it works.
Adjust "...round('20s')" and "if dur > 15:" for whatever candle period you want.
def on_message(self, msg):
df = pd.json_normalize(msg, record_prefix=msg['type'])
df['date'] = df['time']
df['price'] = df['price'].astype(float)
df['low'] = df['low'].astype(float)
for i in range(0, len(self.df)):
if i == (len(self.df) - 1):
self.rounded_time = self.df['date'][i]
self.rounded_time = pd.to_datetime(self.rounded_time).round('20s')
self.lhigh = self.df['price'][i]
self.lhighcandle = self.candle['high'][i]
self.llow = self.df['price'][i]
self.lowcandle = self.candle['low'][i]
self.close = self.df['price'][i]
if self.lhigh > self.lhighcandle:
nhigh = self.lhigh
else:
nhigh = self.lhighcandle
if self.llow < self.lowcandle:
nlow = self.llow
else:
nlow = self.lowcandle
newdata = pd.DataFrame.from_dict({
'date': self.df['date'],
'tkr': tkr,
'open': self.df.price.iloc[0],
'high': nhigh,
'low': nlow,
'close': self.close,
'vol': self.df['last_size']})
self.candle = self.candle.append(newdata, ignore_index=True).fillna(0)
if ctime > self.rounded_time:
closeit = True
self.en = time.time()
if closeit:
dur = (self.en - self.st)
if dur > 15:
self.st = time.time()
out = self.candle[-1:]
out.to_sql(tkr, cnx, if_exists='append')
dat = ['tkr', 0, 0, 100000, 0, 0]
self.candle = pd.DataFrame([dat], columns=['tkr', 'open', 'high', 'low', 'close', 'vol'])
As far as I know, most or all technical indicator formulas rely on same-sized bars to produce accurate and meaningful results. You'll have to do some data transformation. Here's an example of an aggregation technique that uses quantization to get all your bars into uniform sizes. It will convert small bar sizes to larger bar sizes; e.g. second to minute bars.
// C#, see link above for more info
quoteHistory
.OrderBy(x => x.Date)
.GroupBy(x => x.Date.RoundDown(newPeriod))
.Select(x => new Quote
{
Date = x.Key,
Open = x.First().Open,
High = x.Max(t => t.High),
Low = x.Min(t => t.Low),
Close = x.Last().Close,
Volume = x.Sum(t => t.Volume)
});
See Stock.Indicators for .NET for indicators and related tools.

How to calculate the number of time an event occurs in any month if I have the weekly schedule?

I am currently writing a flutter app which includes displaying the weekly schedule of a class. I also have to calculate the attendance of each student. To do that, I need the number of time each subject are taught in any given month and I am stumped as I can't think of a way to do that.
I have the weekly schedule of the class and I stored it in Firestore. It is formatted as below,
{Day: 1 , Major: 'EcE', Subject: 'Communication', Year: 5, Period: 1, ...}
Screenshot of a timetable entry
where Day refers to Monday, Tuesday, ...
It appears in the app like this.
My problem is, to track and calculate the attendance of students, I have to know how many times each subjects are taught in a month but I dont think multiplying said times in a week by 4 is viable since days in a month are dynamic. But I don't know how to work with a calendar programmatically so I am currently out of ideas. Any hints and suggestions are appreciated. The suggestions can be in either dart or node js since I can both implement in client side or cloud functions. Thanks a lot.
P.S - I haven't provide any code for now but please ask me for clarifications and I will provide the related codes. I just didn't want to bloat the post.
If I understood your question correctly all you need is to count the number of each weekday occurrence in a given month.
Here you go, in both js and dart:
JS:
var dtNow = new Date(); // replace this for the month/date of your choosing
var dtFirst = new Date(dtNow.getFullYear(), dtNow.getMonth(), 1); // first date in a month
var dtLast = new Date(dtNow.getFullYear(), dtNow.getMonth() + 1, 0); // last date in a month
// we need to keep track of weekday occurrence in a month, a map looks suitable for this
var dayOccurrence = {
"Monday" : 0,
"Tuesday" : 0,
"Wednesday" : 0,
"Thursday" : 0,
"Friday" : 0,
"Saturday" : 0,
"Sunday" : 0
}
var dtLoop = new Date(dtFirst); // temporary date, used for looping
while(dtLoop <= dtLast){
// getDay() method returns the day of the week, from 0(Sunday) to 6(Saturday)
switch(dtLoop.getDay()) {
case 0:
dayOccurrence.Sunday++; break;
case 1:
dayOccurrence.Monday++; break;
case 2:
dayOccurrence.Tuesday++; break;
case 3:
dayOccurrence.Wednesday++; break;
case 4:
dayOccurrence.Thursday++; break;
case 5:
dayOccurrence.Friday++; break;
case 6:
dayOccurrence.Saturday++; break;
default:
console.log("this should not happen");
}
dtLoop.setDate(dtLoop.getDate() + 1);
}
// log the results
var keys = Object.keys(dayOccurrence);
keys.forEach(key=>{
console.log(key + ' : ' + dayOccurrence[key]);
});
And here is the same thing in dart:
void main() {
DateTime dtNow = new DateTime.now(); // replace this for the month/date of your choosing
DateTime dtFirst = new DateTime(dtNow.year, dtNow.month, 1); // first date in a month
DateTime dtLast = new DateTime(dtNow.year, dtNow.month + 1, 0); // last date in a month
// we need to keep track of weekday occurrence in a month, a map looks suitable for this
Map<String, int> dayOccurrence = {
'Monday' : 0,
'Tuesday' : 0,
'Wednesday' : 0,
'Thursday' : 0,
'Friday' : 0,
'Saturday' : 0,
'Sunday' : 0
};
DateTime dtLoop = DateTime(dtFirst.year, dtFirst.month, dtFirst.day);
while(DateTime(dtLoop.year, dtLoop.month, dtLoop.day) != dtLast.add(new Duration(days: 1))){
// weekday is the day of the week, from 1(Monday) to 7(Sunday)
switch(dtLoop.weekday) {
case 1:
dayOccurrence['Monday']++; break;
case 2:
dayOccurrence['Tuesday']++; break;
case 3:
dayOccurrence['Wednesday']++; break;
case 4:
dayOccurrence['Thursday']++; break;
case 5:
dayOccurrence['Friday']++; break;
case 6:
dayOccurrence['Saturday']++; break;
case 7:
dayOccurrence['Sunday']++; break;
default:
print("this should not happen");
}
dtLoop = dtLoop.add(new Duration(days: 1));
}
// log the results
dayOccurrence.forEach((k,v) => print('$k : $v'));
}

Sklearn RandomizedSearchCV suddenly stuck

I tried to do a RandomizedSearchCV for a SVM model but it seems to take forever. It works fine for KNN. I found the process stuck somewhere after finishing certain tasks. The below is my code:
# SVC Parameter Tuning
svc_params = {'C': np.power(10, np.arange(-5., 1.)),
'kernel': ['rbf', 'linear', 'sigmoid', 'poly'],
'degree': np.arange(3, 21),
'coef0' : np.linspace(0, 1, 100),
'shrinking': [True, False],
'class_weight' : ['balanced', None]}
svc = RandomizedSearchCV(SVC(),
svc_params,
cv = 5,
scoring = 'roc_auc',
n_jobs = 128,
n_iter = 100,
verbose = 2)
after some results, the process stuck.
[CV] kernel=poly, C=0.0001, degree=20, coef0=0.848484848485,
shrinking=True, class_weight=balanced, total= 11.1s
[CV] kernel=poly, C=0.0001, degree=20, coef0=0.848484848485,
shrinking=True, class_weight=balanced, total= 11.0s
[CV] kernel=poly, C=0.0001, degree=20, coef0=0.848484848485,
shrinking=True, class_weight=balanced, total= 11.5s
I am really out of clue. Thanks for your help!

Getting max and min from two different sets in json

I haven't found a solution with data set up quite like mine...
var marketshare = [
{"store": "store1", "share": "5.3%", "q1count": 2, "q2count": 4, "q3count": 0},
{"store": "store2","share": "1.9%", "q1count": 5, "q2count": 10, "q3count": 0},
{"store": "store3", "share": "2.5%", "q1count": 3, "q2count": 6, "q3count": 0}
];
Code so far, returning undefined...
var minDataPoint = d3.min( d3.values(marketshare.q1count) ); //Expecting 2 from store 1
var maxDataPoint = d3.max( d3.values(marketshare.q2count) ); //Expecting 10 from store 2
I'm a little overwhelmed by d3.keys, d3.values, d3.maps, converting to array, etc. Any explanations or nudges would be appreciated.
I think you're looking for something like this instead:
d3.min(marketshare, function(d){ return d.q1count; }) // => 2.
You can pass an accessor function as the second argument to d3.min/d3.max.

Resources