Holt-winter model fit values - forecast

Hw to get model fit model for training period in holt winter method for Univariate time series. For future forecast I can use the following syntax but not sure what is syntax for training period.
result = model.fit()
start = len(df)
end = len(df) + 6
# Predictions for one-year against the test set
fcast = result.forecast(start ,end)```

i got the answer it is
result.fittedvalues

Related

The most efficient way for inference(zero-shot classification HuggingFace) on CPU

I have a pretty large dataset dataset(200k records), which consists of 2 columns:
Text
Labels for prediction
What I want to do is to apply pretrained Roberta model for zero-shot classification. Here is the way I did it:
#convert pandas to dataset:
dataset = Dataset.from_pandas(data)
#loading model:
model = AutoModelForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli')
tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')
classifier = pipeline("zero-shot-classification", model= model,tokenizer = tokenizer ,framework = 'pt')
hypothesis_template = "Im Text geht es um {}"
#define prediction function and apply it to the dataset
def prediction(record,classifier):
hypothesis_template = "Im Text geht es um {}"
output = classifier(record['text'],record['label'],hypothesis_template=hypothesis_template)
record['prediction'] = output['labels'][0]
record['scores'] = output['scores'][0]
return record
dataset.map(lambda x: prediction(x,classifier=classifier))
But I am not sure if it's the most efficient way for inference.
Official page (https://huggingface.co/docs/transformers/main_classes/pipelines) says, that I should avoid batching if I am using CPU. But still my questions:
Is pipeline wrapper pipeline fast enough or should stick to more 'low level'(like native pytorch)?
Is inference though .map considered a good practice? If not, what should be used instead?
Having relative short text(maximum 5-6 words) should batching be used instead of one record at a time?

I can't catch the trend in my model and forecast it

I work for a call center and i need to forecast call volumes
In order to do so i followed these steps :
-Filling missing values with linear interpolation
-Divided my data into trend+residual+seasonality
-Made my data stationnary using stat model
#read dataset
df= pd.read_excel("df.xlsx", index_col=0)
#define daily frequency
df= df.asfreq(df.index.freq)
#replace missing data with linear intepolation
df.interpolate(method='linear', inplace=True)
#decompose time series using seasonal_decompose from SKlearn
result= seasonal_decompose(df.value, model='mult')
#remove seasonality and trend to make data stationnary
df_non_seasonal=df.value.values/
(result_add.seasonal*result_add.trend)
#Make forecasts with Prophet
from atspy import AutomatedModel
model_list = ["Prophet"]
model = AutomatedModel(df = df_non_seasonal,model_list=model_list)
My model works well to forecast call volumes, my problem is that my trend is very hard to forecast, it seems to be impredictible, i can't find a way to catch it
Do you have any advices for the processing of my trend that would make it easier to predict?

Performance issue when finding/assigning the similarity value between sentences within two dataframes

I am trying to calculate the similarity value between lists of strings using spacy word2vec, but the code is talking so much time, and google colab stops working at the end.
The code I come-up with is mentioned below; mainly I have two dataframes, the first includes a list of comments (more than 1.5 million) while the second includes a set of LDA topics represented as topic name and keywords (39 topics). What is required is to create a new column (within the first dataframe) holding the similarity value between the comments and each of the topics' keywords (i.e. 39 columns to be added to the first dataframe, each one represents the similarity values between the comments and one topic).
I run the code for small data set and it worked fine. However for the 1.5M comments and 39 topics keywords, it for more than 2.5 hours then stops. I am not sure if this is the optimal code to achieve the task, any advise is appreciated.
The code is:
for index, row in Post_sent_df.iterrows(): #first dataframe
row = Post_sent_df['Sent_text'][index]
doc1 = nlp2(row)
if doc1.vector_norm:
for index_tp, row_tp in topics_words_df.iterrows(): #second dataframe
row_tp = topics_words_df['TopicKeyWords'][index_tp]
doc2 = nlp2(row_tp)
if doc2.vector_norm:
sim_value = (doc1.similarity(doc2))
col_name = str(index_tp)
Post_sent_df.at[index , index_tp] = sim_value
As gojomo mentioned in his comments, most of the time is used to run the nlp2() function without a real need for its processing, and as I just want to calculate the similarity between word2vectors, I decided to use nlp() through an apply function to calculate the word2vec for the comments, and do the same for the topics, and then loop through the generated word2vecs to calculate the cosine similarity manually, below is the code I used:
#Define function to get word2vec for a sentence
def get_vec(x):
doc = nlp2(x)
vec = doc.vector
return vec
#calculate vec for keywords
topics_words_df['key_words_vec'] = topics_words_df['TopicKeyWords'].apply(lambda x: get_vec(x))
#calculate vec for comments
Post_sent_df['Sent_vec'] = Post_sent_df['Sent_text'].apply(lambda x: get_vec(x))
# calculate cosine similarity
for index, row in Post_sent_df.iterrows():
row = Post_sent_df['Sent_vec'][index]
for index_tp, row_tp in topics_words_df.iterrows():
row_tp = topics_words_df['key_words_vec'][index_tp]
cosine_similarity = np.dot(row, row_tp)/(np.linalg.norm(row)* np.linalg.norm(row_tp))
col_name = str(index_tp)
Post_sent_df.at[index , index_tp] = cosine_similarity

How do I calculate the median values for a year from a 29 years dataset both on hourly basis?

From a longterm dataset on hourly basis, I want to have median values for each hour of 1 representative year. For example: The median value of the first hour from January the 1st for the representative year is calculated from January the 1st from every year in the dataset. The dataset is available here:https://github.com/sugarello/sugarello/blob/master/dfsolarbwdlz.csv
After trying rolling() and groupby(), I ended up creating new data frames by defining criteria for the index.
So far I tried:
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
dfsolar = pd.read_csv('dfsolarbwdlz.csv', delimiter=';')
dfsolar['MESS_DATUM'] = pd.to_datetime(dfsolar['MESS_DATUM'], format='%Y%m%d%H')
dfsolar.set_index('MESS_DATUM')
dfsolar.index = dfsolar['MESS_DATUM']
dfsolarr = dfsolar.drop(columns=["MESS_DATUM"])
By defining criteria for month, day and hour I partially receive the data I am looking for. It is not practical though at all cause I have to repeat it 8760 times. For example only for the 13.th hour of January 1st:
dfsolarWI00 = dfsolarr[((dfsolarr.index.month == 1) & (dfsolarr.index.day == 1) & (dfsolarr.index.hour == 13))]
The output of my last attempt looks like: here
I assume one solution within sort_index()/sort(). However I wasnt able to set up an adequate searching algorithm.
Am I on the right track? What is an elegant solution for my problem?
After looking deeper into the conditions of the groupby-Method, I reordered as follows:
dfsolarrtest = dfsolarr.groupby([dfsolarr.index.month, dfsolarr.index.day, dfsolarr.index.hour]).median()
dfsolarrtest.plot(figsize=(80,40))
and produced the following plot:
If I am not mistaken, I found my solution by reordering the groupby-conditions based on the changing parts of the date of my given format. However:
I generated a dataset consisting of 8784 rows which definitely does not equal to 8760 hours. Also, single median values by:
median_example = dfsolarr[((dfsolar.index.month == 1) & (dfsolarr.index.hour == 16))]
median_example.median()
Are not equal to the exact same date from the calculated dataset with groupby.
Any help?

How do I break up high-cpu requests on Google App Engine?

To give an example of the kind of request that I can't figure out what else to do for:
The application is a bowling score/stat tracker. When someone enters their scores in advanced mode, a number of stats are calculated, as well as their score. The data is modeled as:
Game - members like name, user, reference to the bowling alley, score
Frame - pinfalls for each ball, boolean lists for which pins were knocked down on each ball, information about the path of the ball (stance, target, where it actually went), the score as of that frame, etc
GameStats - stores calculated statistics for the entire game, to be merged with other game stats as needed for statistics display across groups of games.
An example of this information in practice can be found here.
When a game is complete, and a frame is updated, I have to update the game, the frame, every frame after it and possibly some before it (to make sure their scores are correct), and the stats. This operation always flags the CPU monitor. Even if the game isn't complete, and statistics don't need to be calculated, the scores and such need to be updated to show the real-time progress to the user, and so these also get flagged. The average CPU time for this handler is over 7000 mcycles, and it doesn't even display a view. Most people bowl 3 to 4 games per series - if they are entering their scores realtime, at the lanes, that's about 1 request every 2 to 4 minutes, but if they write it all down and enter it later, there are 30-40 of these requests being made in a row.
As requested, the data model for the important classes:
class Stats(db.Model):
version = db.IntegerProperty(default=1)
first_balls=db.IntegerProperty(default=0)
pocket_tracked=db.IntegerProperty(default=0)
pocket=db.IntegerProperty(default=0)
strike=db.IntegerProperty(default=0)
carry=db.IntegerProperty(default=0)
double=db.IntegerProperty(default=0)
double_tries=db.IntegerProperty(default=0)
target_hit=db.IntegerProperty(default=0)
target_missed_left=db.IntegerProperty(default=0)
target_missed_right=db.IntegerProperty(default=0)
target_missed=db.FloatProperty(default=0.0)
first_count=db.IntegerProperty(default=0)
first_count_miss=db.IntegerProperty(default=0)
second_balls=db.IntegerProperty(default=0)
spare=db.IntegerProperty(default=0)
single=db.IntegerProperty(default=0)
single_made=db.IntegerProperty(default=0)
multi=db.IntegerProperty(default=0)
multi_made=db.IntegerProperty(default=0)
split=db.IntegerProperty(default=0)
split_made=db.IntegerProperty(default=0)
class Game(db.Model):
version = db.IntegerProperty(default=3)
user = db.UserProperty(required=True)
series = db.ReferenceProperty(Series)
score = db.IntegerProperty()
game_number = db.IntegerProperty()
pair = db.StringProperty()
notes = db.TextProperty()
simple_entry_mode = db.BooleanProperty(default=False)
stats = db.ReferenceProperty(Stats)
complete = db.BooleanProperty(default=False)
class Frame(db.Model):
version = db.IntegerProperty(default=1)
user = db.UserProperty()
game = db.ReferenceProperty(Game, required=True)
frame_number = db.IntegerProperty(required=True)
first_count = db.IntegerProperty(required=True)
second_count = db.IntegerProperty()
total_count = db.IntegerProperty()
score = db.IntegerProperty()
ball = db.ReferenceProperty(Ball)
stance = db.FloatProperty()
target = db.FloatProperty()
actual = db.FloatProperty()
slide = db.FloatProperty()
breakpoint = db.FloatProperty()
pocket = db.BooleanProperty()
pocket_type = db.StringProperty()
notes = db.TextProperty()
first_pinfall = db.ListProperty(bool)
second_pinfall = db.ListProperty(bool)
split = db.BooleanProperty(default=False)
A few suggestions:
You could store the stats for frames as part of the same entity as the game, rather than having a separate entity for each, by storing it as a list of bitfields (stored in integers) for the pins standing at the end of each half-frame, for example. Let me know if you want more details on how this would be implemented.
Failing that, you can calculate some of the more interrelated stats on fetch. For example, calculating the score-so-far ought to be simple if you have the whole game loaded at once, which means you can avoid having to update multiple frames on every request.
We can be of more help if you show us your data model. :)

Resources