How do I break up high-cpu requests on Google App Engine? - performance

To give an example of the kind of request that I can't figure out what else to do for:
The application is a bowling score/stat tracker. When someone enters their scores in advanced mode, a number of stats are calculated, as well as their score. The data is modeled as:
Game - members like name, user, reference to the bowling alley, score
Frame - pinfalls for each ball, boolean lists for which pins were knocked down on each ball, information about the path of the ball (stance, target, where it actually went), the score as of that frame, etc
GameStats - stores calculated statistics for the entire game, to be merged with other game stats as needed for statistics display across groups of games.
An example of this information in practice can be found here.
When a game is complete, and a frame is updated, I have to update the game, the frame, every frame after it and possibly some before it (to make sure their scores are correct), and the stats. This operation always flags the CPU monitor. Even if the game isn't complete, and statistics don't need to be calculated, the scores and such need to be updated to show the real-time progress to the user, and so these also get flagged. The average CPU time for this handler is over 7000 mcycles, and it doesn't even display a view. Most people bowl 3 to 4 games per series - if they are entering their scores realtime, at the lanes, that's about 1 request every 2 to 4 minutes, but if they write it all down and enter it later, there are 30-40 of these requests being made in a row.
As requested, the data model for the important classes:
class Stats(db.Model):
version = db.IntegerProperty(default=1)
first_balls=db.IntegerProperty(default=0)
pocket_tracked=db.IntegerProperty(default=0)
pocket=db.IntegerProperty(default=0)
strike=db.IntegerProperty(default=0)
carry=db.IntegerProperty(default=0)
double=db.IntegerProperty(default=0)
double_tries=db.IntegerProperty(default=0)
target_hit=db.IntegerProperty(default=0)
target_missed_left=db.IntegerProperty(default=0)
target_missed_right=db.IntegerProperty(default=0)
target_missed=db.FloatProperty(default=0.0)
first_count=db.IntegerProperty(default=0)
first_count_miss=db.IntegerProperty(default=0)
second_balls=db.IntegerProperty(default=0)
spare=db.IntegerProperty(default=0)
single=db.IntegerProperty(default=0)
single_made=db.IntegerProperty(default=0)
multi=db.IntegerProperty(default=0)
multi_made=db.IntegerProperty(default=0)
split=db.IntegerProperty(default=0)
split_made=db.IntegerProperty(default=0)
class Game(db.Model):
version = db.IntegerProperty(default=3)
user = db.UserProperty(required=True)
series = db.ReferenceProperty(Series)
score = db.IntegerProperty()
game_number = db.IntegerProperty()
pair = db.StringProperty()
notes = db.TextProperty()
simple_entry_mode = db.BooleanProperty(default=False)
stats = db.ReferenceProperty(Stats)
complete = db.BooleanProperty(default=False)
class Frame(db.Model):
version = db.IntegerProperty(default=1)
user = db.UserProperty()
game = db.ReferenceProperty(Game, required=True)
frame_number = db.IntegerProperty(required=True)
first_count = db.IntegerProperty(required=True)
second_count = db.IntegerProperty()
total_count = db.IntegerProperty()
score = db.IntegerProperty()
ball = db.ReferenceProperty(Ball)
stance = db.FloatProperty()
target = db.FloatProperty()
actual = db.FloatProperty()
slide = db.FloatProperty()
breakpoint = db.FloatProperty()
pocket = db.BooleanProperty()
pocket_type = db.StringProperty()
notes = db.TextProperty()
first_pinfall = db.ListProperty(bool)
second_pinfall = db.ListProperty(bool)
split = db.BooleanProperty(default=False)

A few suggestions:
You could store the stats for frames as part of the same entity as the game, rather than having a separate entity for each, by storing it as a list of bitfields (stored in integers) for the pins standing at the end of each half-frame, for example. Let me know if you want more details on how this would be implemented.
Failing that, you can calculate some of the more interrelated stats on fetch. For example, calculating the score-so-far ought to be simple if you have the whole game loaded at once, which means you can avoid having to update multiple frames on every request.
We can be of more help if you show us your data model. :)

Related

Performance issue when finding/assigning the similarity value between sentences within two dataframes

I am trying to calculate the similarity value between lists of strings using spacy word2vec, but the code is talking so much time, and google colab stops working at the end.
The code I come-up with is mentioned below; mainly I have two dataframes, the first includes a list of comments (more than 1.5 million) while the second includes a set of LDA topics represented as topic name and keywords (39 topics). What is required is to create a new column (within the first dataframe) holding the similarity value between the comments and each of the topics' keywords (i.e. 39 columns to be added to the first dataframe, each one represents the similarity values between the comments and one topic).
I run the code for small data set and it worked fine. However for the 1.5M comments and 39 topics keywords, it for more than 2.5 hours then stops. I am not sure if this is the optimal code to achieve the task, any advise is appreciated.
The code is:
for index, row in Post_sent_df.iterrows(): #first dataframe
row = Post_sent_df['Sent_text'][index]
doc1 = nlp2(row)
if doc1.vector_norm:
for index_tp, row_tp in topics_words_df.iterrows(): #second dataframe
row_tp = topics_words_df['TopicKeyWords'][index_tp]
doc2 = nlp2(row_tp)
if doc2.vector_norm:
sim_value = (doc1.similarity(doc2))
col_name = str(index_tp)
Post_sent_df.at[index , index_tp] = sim_value
As gojomo mentioned in his comments, most of the time is used to run the nlp2() function without a real need for its processing, and as I just want to calculate the similarity between word2vectors, I decided to use nlp() through an apply function to calculate the word2vec for the comments, and do the same for the topics, and then loop through the generated word2vecs to calculate the cosine similarity manually, below is the code I used:
#Define function to get word2vec for a sentence
def get_vec(x):
doc = nlp2(x)
vec = doc.vector
return vec
#calculate vec for keywords
topics_words_df['key_words_vec'] = topics_words_df['TopicKeyWords'].apply(lambda x: get_vec(x))
#calculate vec for comments
Post_sent_df['Sent_vec'] = Post_sent_df['Sent_text'].apply(lambda x: get_vec(x))
# calculate cosine similarity
for index, row in Post_sent_df.iterrows():
row = Post_sent_df['Sent_vec'][index]
for index_tp, row_tp in topics_words_df.iterrows():
row_tp = topics_words_df['key_words_vec'][index_tp]
cosine_similarity = np.dot(row, row_tp)/(np.linalg.norm(row)* np.linalg.norm(row_tp))
col_name = str(index_tp)
Post_sent_df.at[index , index_tp] = cosine_similarity

Is MNL the right model to use when the choice options vary across observations?

In a survey of 100 people, I am asking each person to choose between product A and product B. I ask each person this question 3 times, but each time I present a different set of products. Say, first time, Person 1 is asked to choose between 'Phone 1' and 'Phone 2', given certain attributes of each phone. The second time the choice is again 'Phone 1' vs. 'Phone 2', but a different set of attributes for each phone.
A person is presented three attributes associated with the two phone alternatives every time the question is asked. So, each time between Phone 1 and Phone 2, the attributes of the phone such as cost, memory and camera pixels are presented so that user can choose which set of attributes is most attractive, Phone 1's or Phone 2's.
Overall, 3*100 = 300 responses; 3 responses per person. Each time the attributes cost, memory and camera pixels presented and user asked to choose the feature set they prefer.
My goal is to analyze how users value features of a phone vs. cost of the phone.
In this scenario, can I use a MNL - even though each time I asked the person a question, I only presented two choices ? My understanding is that MNL is sued when (a) there are multiple choices and (b) the choice options do not change across observations, i.e. each person is asked to choose between multiple products, say A, B, C and A, B, C do not change across observations.
In the scenario described above, the two choices varied across the three times the same person was asked the question ? If not MNL, should I rather create a binary logit model given that user only had to choose between two options when the question was asked (even though he was asked the question three times)? If I can use binary logit, should I be concerned that the choice set of products change across observations ? or should I let the attributes defined in each of the rows address the differences in product choices across observations.
I have setup the data as follows (thinking I can do MNL but may be I should set it up differently and use another modeling approach?):
I am working on designing and analyzing similar survey but mine is related to transportation. I am at the beginning level and I am still new to the whole concept, however I will give you an advice and reference maybe it is helpful.
First point: I have come cross 3 models as following from a useful video on YouTube:
MNL refers to Multinomial Logit Model. MNL is used with
alternative-invariant regressors (for example salary of participant
in the survey, or his/her gender …).
Conditional logit model is used with alternative-invariant (gender,
salary, education level …) and alternative-variant regressors (cost
of the product, memory, camera pixel …)
Mixed logit model which uses random parameters. It is also used with
alternative-invariant (gender, salary, education level …) and
alternative-variant regressors (cost of the product, memory, camera
pixel …)
Note regarding alternative-invariant and alternative-variant regressors:
The gender of person participating in the survey will NOT vary between Product A or Product P, so it is alternative-invariant regressor. While price of product could vary between Product A and Product B so it is called alternative-variant regressors.
Based on above I assume you need to use conditional logit model or mixed logit model.
For me I couldn’t find a special function in R for the conditional logit model or mixed logit model. The same mlogit function is used, refer to the examples below for the help of mlogit package:
a pure "multinomial model"
summary(mlogit(mode ~ 0 | income, data = Fish))
a pure "conditional" model
summary(mlogit(mode ~ price + catch, data = Fish))
a "mixed" model
m <- mlogit(mode ~ price+ catch | income, data = Fish)
summary(m)
same model with charter as the reference level
m <- mlogit(mode ~ price+ catch | income, data = Fish, reflevel = "charter")
From the examples above, I think (but NOT sure) that in the Manual of mlogit package, they refer to mixed logit when you used both alternative-invariant and alternative-variant regressors. While conditional model when you have only alternative-variant regressors. On the other hand, multinomial model when you have only multinomial alternative-invariant regressors.
Second point: There is something called “panel data” when you are asking the same person to choose one product for each choice-set. Same person here means that in your model you are taking into consideration, the gender, the salary, the education level … which they will stay the same for the same person. Check this: https://en.wikipedia.org/wiki/Panel_data
To use panel techniques please refer to help in mlogit package in R. I am quoting from it the following:
“panel only relevant if rpar is not NULL and if the data are repeated observations of the same unit ; if TRUE, the mixed-logit model is estimated using panel techniques”
So in my understanding, if you want to use the panel techniques you have to use random draws because panel will be true and rpar will not be NULL.
Moreover, for example about using the panel data, please refer to the below example from “Estimation of multinomial logit models in R : The mlogit Packages” by Yves Croissant
data("Train", package = "mlogit")
Tr <- mlogit.data(Train, shape = "wide", varying = 4:11, choice = "choice", sep = "_", opposite = c("price", "time", "change", "comfort"), alt.levels=c("A", "B"), id.var ="id")
Train.ml <- mlogit(choice ~ price + time + change + comfort, Tr)
Train.mxlc <- mlogit(choice ~ price + time + change + comfort, Tr, panel = TRUE, rpar = c(time = "cn", change = "n", comfort = "ln"), correlation = TRUE, R = 100, halton = NA)
Train.mxlu <- update(Train.mxlc, correlation = FALSE)
I hope that is useful to you.

protection not working with api.constrains

I want to forbid making product if there is no "qty_available". But this code is not working.
It works only if i change #api.constrains to #api.onchange('move_lines') but if i do it with onchange there is still possibility to save record.
as api.constrains ingores doted names, how can i make this work
class mrp_production(osv.osv):
_inherit = 'mrp.production'
#api.constrains('qty_available', 'move_lines.qty_available')
def move_lines_check(self):
for line in self.move_lines:
if line.qty_available < 1:
raise ValidationError(_('There is not enough raw material, check Quantity on hand'))
UPDATE goal
So again goal is to make Warning appear if there is no raw materials to make product from (we can't manufacture from nothing) and it should be impossible to make product if there is not enough materials.
Please add below constraint to mrp.production model to restrict saving Manufacturing order, if raw material product is not enough for production.
from openerp import api
from openerp.exceptions import Warning
#api.one
#api.constrains('move_lines','bom_id')
def _check_product_stock_availability(self):
if self.move_lines:
for move in self.move_lines:
qty_available = move.product_id.with_context(location=move.location_id.id).qty_available
if qty_available < move.product_uom_qty:
raise Warning(_('There is not enough raw material, check Quantity on hand.'))
elif self.bom_id:
factor = self.product_uom._compute_qty(self.product_uom.id,self.product_qty, self.bom_id.product_uom.id)
result, result2 = self.bom_id._bom_explode(self.bom_id,self.product_id, factor / self.bom_id.product_qty, None, routing_id=self.routing_id.id)
product_obj = self.env['product.product']
for line in result:
qty_available = product_obj.browse(line.get('product_id')).with_context(location=self.location_src_id.id).qty_available
#qty_available = line.product_id.with_context(location=self.location_src_id.id).qty_available
if qty_available < line.get('product_qty'):
raise Warning(_('There is not enough raw material, check Quantity on hand for products in BOM.'))

h2o H2OAutoEncoderEstimator

I was trying to detect outliers using the H2OAutoEncoderEstimator.
Basically I load 4 KPIs from a CSV file.
For each KPI I have 1 month of data.
The data in the CSV file has been manually created and are all the same for each KPI
The following picture shows the trend of the KPIs:
The first black vertical line (x=4000) indicates the end of the training data.
All the others light black vertical lines indicate the data that I use to detect the outliers every time.
As you can see data are very regular (I'v copied & pasted first 1000 rows 17 times).
This is what my code does:
Loads the data from a CSV file (1 row represents the value of all kpis in a specific timestamp)
Trains the model using the first 4000 timestamps
Starting from the 4001 timestamp, every 250 Timestamps it calls the function model.anomaly to detect the outliers in a specific window (250 timestamps)
My questions are:
Is it normal that every time that I call the function model.anomaly the errors returned increases every time (from 0.1 to 1.8)?
If I call again model.train, the training phase will be performed from scratch replacing the existing model or it will be updated with the new data provided?
This is my python code:
data = loadDataFromCsv()
nOfTimestampsForTraining = 4000
frTrain = h2o.H2OFrame(data[:nOfTimestampsForTraining])
colsName = frTrain.names
model = H2OAutoEncoderEstimator(activation="Tanh",
hidden=[5,4,3],
l1=1e-5,
ignore_const_cols=False,
autoencoder=True,
epochs=100)
# Init indexes
nOfTimestampsForWindows = 250
fromIndex = nOfTimestampsForWindows
toIndex = fromWindow + nOfTimestampsForWindows
# Perform the outlier detection every nOfTimestampsForWindows TimeStamps
while toIndex <= len(data) :
frTest = h2o.H2OFrame(data[fromWindow:toWindow])
error = model.anomaly(frTest)
df = error.as_data_frame()
print(df)
print(df.describe())
# Adjust indexes for the next window
fromIndex = toIndex
toIndex = fromIndex + nOfTimestampsForWindows

Titan remove many vertices has poor performance

Improving performance when deleting multiple vertices?
I have a big graph in Titan (using rexster and cassandra) which I would like to be able to remove a potentially large portion of.
I want to delete everything a user "owns" which isn't in the provided list of IDs, and return the list of IDs of what they actually still "own" in the graph. I have tried these two approaches, both of which takes very long. I am testing with the worst-case scenario where list_of_ids_to_keep is empty([]).
//A naive list-based approach
def delete_mulitple(graph, user_id, list_of_ids_to_keep)
{
u = get_or_create_user(g, user_id)
user_things = u.outE('owns').inV().collect{it.id}
removees = user_things - list_of_ids_to_keep
g.v(*(removees)).remove() //This is the slow part of this approach
g.commit()
return (user_things - removees)
}
//Approach using Hash to potentially reduce memory footprint and O-notation, but is still taking a long time.
def delete_multiple_hash(graph, user_id, list_of_ids_to_keep)
{
def u = get_or_create_user(g, user_id)
def user_things = u.outE('owns').inV().collect{it.id}
def inUserGraph = [:]
def keepers = [:]
list_of_ids_to_keep.each { k_id ->
inUserGraph[k_id] = true
}
user_things.each {k_id ->
if (!inUserGraph[k_id]){
g.v(k_id).remove() //Again: the remove is the slow part
}
else{
keepers << k_id
}
}
g.commit()
return keepers
}
Is there a way to improve the performance of this using some sort of bulk operation, turning off some checks, doing an async query, or applying some other strategy?
I have already enabled "storage.batch-loading" which I though might improve performance. I measure the query times to be about 150 seconds, but then nothing is commited, so I believe it is timing out.
The layout of the graph is:
User (-owns->) A Thing (-contains->) A Description (can be shared by multiple "A Thing"s)
Numbers:
User -owns-> ~= 200K "A Thing"s. (g.v(user_id).outE('owns').count() ~= 200K)
A Thing -contains-> ~= 100 "A Description"s (g.v(a_thing_id).outE('contains').count() ~= 100)
A Description <-contains- ~= 100 "A Thing" (g.v(a_description_id).inE('contains').count ~= 100)

Resources