Calculate time remaining with different length of variables - time

I will have to admit the title of this question sucks... I couldn't get the best description out. Let me see if I can give an example.
I have about 2700 customers with my software at one time was installed on their server. 1500 or so still do. Basically what I have going on is an Auto Diagnostics to help weed out people who have uninstalled or who have problems with the software for us to assist with. Currently we have a cURL fetching their website for our software and looking for a header return.
We have 8 different statuses that are returned
GREEN - Everything works (usually pretty quick 0.5 - 2 seconds)
RED - Software not found (usually the longest from 5 - 15 seconds)
BLUE - Software found but not activated (usually from 3 - 9 seconds)
YELLOW - Server IP mismatch (usually from 1 - 3 seconds)
ORANGE - Server IP mismatch and wrong software type (usually 5 - 10 seconds)
PURPLE - Activation key incorrect (usually within 2 seconds)
BLACK - Domain returns 404 - No longer exists (usually within a second)
UNK - Connection failed (usually due to our load balancer -- VERY rare) (never countered this yet)
Now basically what happens is a cronJob will start the process by pulling the domain and product type. It will then cURL the domain and start cycling through the status colors above.
While this is happening we have an ajax page that is returning the results so we can keep an eye on the status. The major problem is the Time Remaining is so volatile that it does not do a good estimate. Here is the current math:
# Number of accounts between NOW and when started
$completedAccounts = floor($parseData[2]*($parseData[1]/100));
# Number of seconds between NOW and when started
$completedTime = strtotime("now") - strtotime("$hour:$minute:$second");
# Avg number of seconds per account
$avgPerCompleted = $completedTime / $completedAccounts;
# Total number of remaining accounts to be scanned
$remainingAccounts = $parseData[2] - $completedAccounts;
# The total of seconds remaining for all of the remaining accounts
$remainingSeconds = $remainingAccounts * $avgPerCompleted;
$remainingTime = format_time($remainingSeconds, ":");
I could create a count on all of the green, red, blue, etc... and do an average of how long each color does, then use that for the average time, although I don't believe that would give much better results.
With the difference in times that are so varied, any suggestions would be grateful?
Thanks,
Jeff

OK, I believe I have figured it out. I had to create a class so I could calculate a single regression over a period of time.
function calc() {
$n = count($this->mDatas);
$vSumXX = $vSumXY = $vSumX = $vSumY = 0;
//var_dump($this->mDatas);
$vCnt = 0; // for time-series, start at t=0<br />
foreach ($this->mDatas AS $vOne) {
if (is_array($vOne)) { // x,y pair<br />
list($x,$y) = $vOne;
} else { // time-series<br />
$x = $vCnt; $y = $vOne;
} // fi</p>
$vSumXY += $x*$y;
$vSumXX += $x*$x;
$vSumX += $x;
$vSumY += $y;
$vCnt++;
} // rof
$vTop = ($n*$vSumXY – $vSumX*$vSumY);
$vBottom = ($n*$vSumXX – $vSumX*$vSumX);
$a = $vBottom!=0?$vTop/$vBottom:0;
$b = ($vSumY – $a*$vSumX)/$n;
//var_dump($a,$b);
return array($a,$b);
}
I take each account and start building an array, for the amount of time it takes for each one. The array then runs through this calculation so it will build a x and y time sets. Finally I then run the array through the predict function.
/** given x, return the prediction y */
function calcpredict($x) {
list($a,$b) = $this->calc();
$y = $a*$x+$b;
return $y;
}
I put static values in so you could see the results:
$eachTime = array(7,1,.5,12,11,6,3,.24,.12,.28,2,1,14,8,4,1,.15,1,12,3,8,4,5,8,.3,.2,.4,.6,4,5);
$forecastProcess = new Linear($eachTime);
$forecastTime = $forecastProcess->calcpredict(5);
This overall system gives me about a .003 difference in 10 accounts and about 2.6 difference in 2700 accounts. Next will be to calculate the Accuracy.
Thanks for trying guys and gals

Related

Matlab ROS slow publishing + subscribing

In my experience Matlab performs publish subscribe operations with ROS slow for some reason. I work with components as defined in an object class as shown below, where I made a test-class. Normally objects of comparable structure are used to control mobile robots.
To quantify performance tested required time for an operation and got the following results:
1x publishing a message + 1x simple subscriber callback : 3.7ms
Simply counting in a callback (per count): 2.1318e-03 ms
Creating a new message with msg1 = rosmessage(obj.publisher) adds 3.6-4.3ms per iteration
Pinging myself indicated communication latency of 0.05 ms
The times required for a simple publish + start of a subscribe callback seems oddly slow.
I want to have multiple system components as objects in my workspace such that they respond to ROS topic updates or on timer events. The pc used for testing is not a monster but should not be garbage either.
Do you also think the shown time requirements are unneccesary large? this allows barely to publish a single topic at 200hz without doing anything else. Normally I have multiple lower frequency topics (e.g.20hz) but the total consumed time becomes significant.
Do you know any practices to make the system operate quicker?
What do you think of the OOP style of making control system components in general?
classdef subpubspeedMonitor < handle
% Use: call in matlab console, after initializing ros:
%
% SPM1 = subpubspeedMonitor()
%
% This will create an object which starts a set repetitive task upon creation
% and finally destructs itself after posting results in console.
properties
node
subscriber
publisher
timestart
messagetotal
end
methods
function obj = subpubspeedMonitor()
obj.node = ros.Node('subspeedmonitor1');
obj.subscriber = ros.Subscriber(obj.node,'topic1','sensor_msgs/NavSatFix',{#obj.rosSubCallback});
obj.publisher = ros.Publisher(obj.node,'topic1','sensor_msgs/NavSatFix');
obj.timestart = tic;
obj.messagetotal = 0;
msg1 = rosmessage(obj.publisher);
% Choose to evaluate subscriber + publisher loop or just counting
if 1
send(obj.publisher,msg1);
else
countAndDisplay(obj)
end
end
%% Test method one: repetitive publishing and subscribing
function rosSubCallback(obj,~,msg_) % ~3.7 ms per loop for a simple publish+subscribe action
% Latency to self is 0.05ms on average, according to "pinging" in terminal
obj.messagetotal = obj.messagetotal+1;
if obj.messagetotal <10000
%msg1 = rosmessage(obj.publisher); % this line adds 4.3000ms per loop
msg_.Longitude = 51; % this line adds 0.25000 ms per loop
send(obj.publisher,msg_)
else
% Display some results
timepassed = toc(obj.timestart);
time_per_pubsub = timepassed/obj.messagetotal
delete(obj);
end
end
%% Test method two: simply counting
function countAndDisplay(obj) % this costs 2.1318e-03 ms(!) per loop
obj.messagetotal = obj.messagetotal+1;
if obj.messagetotal <10000
%msg1 = rosmessage(obj.publisher); %adds 3.6ms per loop
%i = 1% adds 5.7532e-03 ms per loop
%msg1 = rosmessage("std_msgs/Bool"); %adds 1.5ms per loop
countAndDisplay(obj);
else
% Display some results
timepassed = toc(obj.timestart);
time_per_count_FCN = timepassed/obj.messagetotal
delete(obj);
end
end
%% Deconstructor
function delete(obj)
delete(obj.subscriber)
delete(obj.publisher)
delete(obj.node)
end
end
end

auto_arima() m value, and seasonal decomposition period parameter

I am working on arima modeling. The data has hourly granularity - taken from 1st May 2022 till 8th June 2022. I am trying to do forecasting for next 30 days i.e 720 hours. I am facing trouble & getting confused with the below doubts. If anybody could provide pointers then it will be great.
Tried plotting the raw data & found no trend, and seasonality
a) Checked with seasonal_decomposition() with a few period values with period=1 (correct with my understanding that season should be 0)
b) period = 12 (just random - but why it is showing some seasons?. Even if I pot without period for which default value is 7, it still shows season - why?)
Plotted this graph with seasonality value False as in the raw plot I do not see any seasons/trend & getting the below plot. How & what should be concluded???
Then I thought of capturing this season thing through resampling by plotting daily graph and getting further confused.
a) period - 7 (default for seasonal_decomposition), again I can see seasonality of 4 days when the raw plot do not show seasons.
The forecasting for this resampled (daily) data is below
I am extremely clueless now as to what to see. The more I am reading the more I am getting confused.
Below is the code that I am using.
df=pd.read_csv('~/Desktop/gru-scl/gru-scl-filtered.csv', index_col="time")
del df["Index"]
df.index=pd.to_datetime(df.index)
model = pm.auto_arima(df.bps, start_p=0, start_q=0,
test='adf', # use adftest to find optimal 'd'
max_p=3, max_q=3, # maximum p and q
m=24, # frequency of series
d=None, # let model determine 'd'
seasonal=False, # No Seasonality
start_P=0,
D=0,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
f_steps=720
fc, confint = model.predict(n_periods=f_steps, return_conf_int=True)
fc_index = np.arange(len(df.bps), len(df.bps)+f_steps)
val=0
for f in fc:
val = val+f
mean = val/f_steps
print(mean)
# make series for plotting purpose
fc_series = pd.Series(fc, index=fc_index)
lower_series = pd.Series(confint[:, 0], index=fc_index)
upper_series = pd.Series(confint[:, 1], index=fc_index)
# Plot
plt.plot(df.bps, label="Actual values")
plt.plot(fc, color='darkgreen', label="Predicted values")
plt.fill_between(fc_index,
lower_series,
upper_series,
color='k', alpha=.15)
plt.legend(loc='upper left', fontsize=8)
plt.title('Forecast vs Actuals')
plt.xlabel("Hours since 1st May 2022")
plt.ylabel("Bps")
plt.show()

Pyspark performance tunning - cache or not to cache?

I am trying to speed up the calculations from multiple operations that I am adding as columns in a pyspark data frame, when I found the sparkbyexamples article on performance tunning. I am considering how to use the cache and the spark.sql.shuffle.partitions, solutions.
Would cache be appropriate for a code that first joins multiple data
frames and then adds calculations over different windows?
What happens when reassigning the cached data frame (see bellow)?
Example:
df = dfA.join(dfB, on = ['key'], how ='left') # should I add .cache here?
w_u = Window.partitionBy('user')
w_m = Window.partitionBy(['user','month']).orderBy('month')\
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
MLAB = ['val1','val2'] # example to indicate that I run similar operations multiple times
for mlab in MLAB:
percent_50 = F.expr('percentile_approx('+mlab+',0.5)')
df = df.withColumn(mlab+'_md', percent_50.over(w_u) # what happens with the cache when I reassing it
Afterwards I am adding additional operations that include aggregations, such as:
radius_df = (df
# number of visits per stop
.groupby('userId', 'locationId').agg(F.count(F.lit(1)).alias('n_i'),
F.first('locationLongitude').alias('locationLongitude'),
F.first('locationLatitude').alias('locationLatitude'))
#compute center of mass (lat/lon) per user
.withColumn('center_lon', F.avg(F.col('locationLongitude')).over(w))
.withColumn('center_lat', F.avg(F.col('locationLatitude')).over(w))
# compute total visits
.withColumn('N', F.sum(F.col('n_i')).over(w))
# compute (r_i - r_cm)
.withColumn('distance', distance(F.col('locationLatitude'), F.col('locationLongitude'), F.col('center_lat'), F.col('center_lon')))
# compute n_i(r_i - r_cm)^2 / N
.withColumn('distance2', F.col('n_i') * (F.col('distance') * F.col('distance')) / F.col('N'))
# compute sum(n_i(r_i - r_cm)^2)
.groupBy('userId').agg(F.sum(F.col('distance2')).alias('sum_dist2'))
# square root
.withColumn('radius_gyr', F.sqrt(F.col('sum_dist2')))
.select('userId','radius_gyr')
)
df_f = df.join(radius_df.dropDuplicates(), on='userId', how='left')
I am open to any suggestions on how to speed up the code. Many thanks.

Pinescript duplicate alerts

I have created a very basic script in pinescript.
study(title='Renko Strat w/ Alerts', shorttitle='S_EURUSD_5_[MakisMooz]', overlay=true)
rc = close
buy_entry = rc[0] > rc[2]
sell_entry = rc[0] < rc[2]
alertcondition(buy_entry, title='BUY')
alertcondition(sell_entry, title='SELL')
plot(buy_entry/10)
The problem is that I get a lot of duplicate alerts. I want to edit this script so that I only get a 'Buy' alert when the previous alert was a 'Sell' alert and visa versa. It seems like such a simple problem, but I have a hard time finding good sources to learn pinescript. So, any help would be appreciated. :)
One way to solve duplicate alters within the candle is by using "Once Per Bar Close" alert. But for alternative alerts (Buy - Sell) you have to code it with different logic.
I Suggest to use Version 3 (version shown above the study line) than version 1 and 2 and you can accomplish the result by using this logic:
buy_entry = 0.0
sell_entry = 0.0
buy_entry := rc[0] > rc[2] and sell_entry[1] == 0? 2.0 : sell_entry[1] > 0 ? 0.0 : buy_entry[1]
sell_entry := rc[0] < rc[2] and buy_entry[1] == 0 ? 2.0 : buy_entry[1] > 0 ? 0.0 : sell_entry[1]
alertcondition(crossover(buy_entry ,1) , title='BUY' )
alertcondition(crossover(sell_entry ,1), title='SELL')
You'll have to do it this way
if("Your buy condition here")
strategy.entry("Buy Alert",true,1)
if("Your sell condition here")
strategy.entry("Sell Alert",false,1)
This is a very basic form of it but it works.
You were getting duplicate alerts because the conditions were fulfulling more often. But with strategy.entry(), this won't happen
When the sell is triggered, as per paper trading, the quantity sold will be double (one to cut the long position and one to create a short position)
PS :You will have to add code to create alerts and enter this not in study() but strategy()
The simplest solution to this problem is to use the built-in crossover and crossunder functions.
They consider the entire series of in-this-case close values, only returning true the moment they cross rather than every single time a close is lower than the close two candles ago.
//#version=5
indicator(title='Renko Strat w/ Alerts', shorttitle='S_EURUSD_5_[MakisMooz]', overlay=true)
c = close
bool buy_entry = false
bool sell_entry = false
if ta.crossover(c[1], c[3])
buy_entry := true
alert('BUY')
if ta.crossunder(c[1], c[3])
sell_entry := true
alert('SELL')
plotchar(buy_entry, title='BUY', char='B', location=location.belowbar, color=color.green, offset=-1)
plotchar(sell_entry, title='SELL', char='S', location=location.abovebar, color=color.red, offset=-1)
It's important to note why I have changed to the indices to 1 and 3 with an offset of -1 in the plotchar function. This will give the exact same signals as 0 and 2 with no offset.
The difference is that you will only see the character print on the chart when the candle actually closes rather than watch it flicker on and off the chart as the close price of the incomplete candle moves.

Error in setting max features parameter in Isolation Forest algorithm using sklearn

I'm trying to train a dataset with 357 features using Isolation Forest sklearn implementation. I can successfully train and get results when the max features variable is set to 1.0 (the default value).
However when max features is set to 2, it gives the following error:
ValueError: Number of features of the model must match the input.
Model n_features is 2 and input n_features is 357
It also gives the same error when the feature count is 1 (int) and not 1.0 (float).
How I understood was that when the feature count is 2 (int), two features should be considered in creating each tree. Is this wrong? How can I change the max features parameter?
The code is as follows:
from sklearn.ensemble.iforest import IsolationForest
def isolation_forest_imp(dataset):
estimators = 10
samples = 100
features = 2
contamination = 0.1
bootstrap = False
random_state = None
verbosity = 0
estimator = IsolationForest(n_estimators=estimators, max_samples=samples, contamination=contamination,
max_features=features,
bootstrap=boostrap, random_state=random_state, verbose=verbosity)
model = estimator.fit(dataset)
In the documentation it states:
max_features : int or float, optional (default=1.0)
The number of features to draw from X to train each base estimator.
- If int, then draw `max_features` features.
- If float, then draw `max_features * X.shape[1]` features.
So, 2 should mean take two features and 1.0 should mean take all of the features, 0.5 take half and so on, from what I understand.
I think this could be a bug, since, taking a look in IsolationForest's fit:
# Isolation Forest inherits from BaseBagging
# and when _fit is called, BaseBagging takes care of the features correctly
super(IsolationForest, self)._fit(X, y, max_samples,
max_depth=max_depth,
sample_weight=sample_weight)
# however, when after _fit the decision_function is called using X - the whole sample - not taking into account the max_features
self.threshold_ = -sp.stats.scoreatpercentile(
-self.decision_function(X), 100. * (1. - self.contamination))
then:
# when the decision function _validate_X_predict is called, with X unmodified,
# it calls the base estimator's (dt) _validate_X_predict with the whole X
X = self.estimators_[0]._validate_X_predict(X, check_input=True)
...
# from tree.py:
def _validate_X_predict(self, X, check_input):
"""Validate X whenever one tries to predict, apply, predict_proba"""
if self.tree_ is None:
raise NotFittedError("Estimator not fitted, "
"call `fit` before exploiting the model.")
if check_input:
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
if issparse(X) and (X.indices.dtype != np.intc or
X.indptr.dtype != np.intc):
raise ValueError("No support for np.int64 index based "
"sparse matrices")
# so, this check fails because X is the original X, not with the max_features applied
n_features = X.shape[1]
if self.n_features_ != n_features:
raise ValueError("Number of features of the model must "
"match the input. Model n_features is %s and "
"input n_features is %s "
% (self.n_features_, n_features))
return X
So, I am not sure on how you can handle this. Maybe figure out the percentage that leads to just the two features you need - even though I am not sure it'll work as expected.
Note: I am using scikit-learn v.0.18
Edit: as #Vivek Kumar commented this is an issue and upgrading to 0.20 should do the trick.

Resources