h2o H2OAutoEncoderEstimator - h2o

I was trying to detect outliers using the H2OAutoEncoderEstimator.
Basically I load 4 KPIs from a CSV file.
For each KPI I have 1 month of data.
The data in the CSV file has been manually created and are all the same for each KPI
The following picture shows the trend of the KPIs:
The first black vertical line (x=4000) indicates the end of the training data.
All the others light black vertical lines indicate the data that I use to detect the outliers every time.
As you can see data are very regular (I'v copied & pasted first 1000 rows 17 times).
This is what my code does:
Loads the data from a CSV file (1 row represents the value of all kpis in a specific timestamp)
Trains the model using the first 4000 timestamps
Starting from the 4001 timestamp, every 250 Timestamps it calls the function model.anomaly to detect the outliers in a specific window (250 timestamps)
My questions are:
Is it normal that every time that I call the function model.anomaly the errors returned increases every time (from 0.1 to 1.8)?
If I call again model.train, the training phase will be performed from scratch replacing the existing model or it will be updated with the new data provided?
This is my python code:
data = loadDataFromCsv()
nOfTimestampsForTraining = 4000
frTrain = h2o.H2OFrame(data[:nOfTimestampsForTraining])
colsName = frTrain.names
model = H2OAutoEncoderEstimator(activation="Tanh",
hidden=[5,4,3],
l1=1e-5,
ignore_const_cols=False,
autoencoder=True,
epochs=100)
# Init indexes
nOfTimestampsForWindows = 250
fromIndex = nOfTimestampsForWindows
toIndex = fromWindow + nOfTimestampsForWindows
# Perform the outlier detection every nOfTimestampsForWindows TimeStamps
while toIndex <= len(data) :
frTest = h2o.H2OFrame(data[fromWindow:toWindow])
error = model.anomaly(frTest)
df = error.as_data_frame()
print(df)
print(df.describe())
# Adjust indexes for the next window
fromIndex = toIndex
toIndex = fromIndex + nOfTimestampsForWindows

Related

Performance issue when finding/assigning the similarity value between sentences within two dataframes

I am trying to calculate the similarity value between lists of strings using spacy word2vec, but the code is talking so much time, and google colab stops working at the end.
The code I come-up with is mentioned below; mainly I have two dataframes, the first includes a list of comments (more than 1.5 million) while the second includes a set of LDA topics represented as topic name and keywords (39 topics). What is required is to create a new column (within the first dataframe) holding the similarity value between the comments and each of the topics' keywords (i.e. 39 columns to be added to the first dataframe, each one represents the similarity values between the comments and one topic).
I run the code for small data set and it worked fine. However for the 1.5M comments and 39 topics keywords, it for more than 2.5 hours then stops. I am not sure if this is the optimal code to achieve the task, any advise is appreciated.
The code is:
for index, row in Post_sent_df.iterrows(): #first dataframe
row = Post_sent_df['Sent_text'][index]
doc1 = nlp2(row)
if doc1.vector_norm:
for index_tp, row_tp in topics_words_df.iterrows(): #second dataframe
row_tp = topics_words_df['TopicKeyWords'][index_tp]
doc2 = nlp2(row_tp)
if doc2.vector_norm:
sim_value = (doc1.similarity(doc2))
col_name = str(index_tp)
Post_sent_df.at[index , index_tp] = sim_value
As gojomo mentioned in his comments, most of the time is used to run the nlp2() function without a real need for its processing, and as I just want to calculate the similarity between word2vectors, I decided to use nlp() through an apply function to calculate the word2vec for the comments, and do the same for the topics, and then loop through the generated word2vecs to calculate the cosine similarity manually, below is the code I used:
#Define function to get word2vec for a sentence
def get_vec(x):
doc = nlp2(x)
vec = doc.vector
return vec
#calculate vec for keywords
topics_words_df['key_words_vec'] = topics_words_df['TopicKeyWords'].apply(lambda x: get_vec(x))
#calculate vec for comments
Post_sent_df['Sent_vec'] = Post_sent_df['Sent_text'].apply(lambda x: get_vec(x))
# calculate cosine similarity
for index, row in Post_sent_df.iterrows():
row = Post_sent_df['Sent_vec'][index]
for index_tp, row_tp in topics_words_df.iterrows():
row_tp = topics_words_df['key_words_vec'][index_tp]
cosine_similarity = np.dot(row, row_tp)/(np.linalg.norm(row)* np.linalg.norm(row_tp))
col_name = str(index_tp)
Post_sent_df.at[index , index_tp] = cosine_similarity

Selecting top 10 observations for each data type (SAS)

I am trying to select the top 10 exposures for each class of business out of a large data set.
Below is an example of the dataset.
dataset example
If I were to need the top 10 exposures then I would simply sort by exposure descending (as I have done) and use the (obs = 10) command.
However I require the top 10 for each LOB.
Do you know how I could do this in SAS?
Thanks!
I would create a counting dummy variable, counting the number of exposures per lines of business and then delete any observation for which the dummy variable exceeds 10.
This can be done in a single datastep (given that the data is properly sorted) by (ab-)using that SAS code runs top to bottom.
proc sort data = have out=temp; by lob descending exposure; run;
data want(drop=countlob);
retain countlob;
set temp;
by lob;
countlob = countlob + 1;
if first.lob then countlob = 1;
if countlob > 10 then delete;
run;

How can I more efficiently find the height of a table using Python

I am using openpyxl to copy data from an Excel spreadsheet. The data is a table for an inventory database, where each row is an entry in the database. I read the table one row at a time using a for loop. In order to determine the range of the for loop, I wrote a function that examines each cell in the table to find the height of the table.
Code:
def find_max(self, sheet, row, column):
max_row = 0
cell_top = sheet.cell(row = row - 1, column = column)
while cell_top.value != None:
cell = sheet.cell(row = row, column = column)
max = 0
while cell.value != None or sheet.cell(row = row + 1, column = column).value != None:
row += 1
max = max + 1
cell = sheet.cell(row = row, column = column)
if max > max_row:
max_row = max
cell_top = sheet.cell(row = row, column = column + 1)
return max_row
To summarize the function, I move to the next column in the worksheet and then iterate through every cell in that sheet, keeping track of its height until there are no more columns. The catch about this function is that it has to find two empty cells in a row in order to fail the condition. In a previous version I used a similar approach, but only used one column and stopped as soon as I found a blank cell. I had to change it so the program would still run if the user forgot to fill out a column. This function works okay for a small table, but on a table with several hundred entries this makes the program run much slower.
My question is this: What can I do to make this more efficient? I know nesting a while loop like that makes a program take longer but I do not see how to get around it. I have to make the program as foolproof as possible, so I need to check more than one column to stop user errors from failing the program
This is untested, but every time I've used openpyxl, I iterate over all rows like so:
for row in active_worksheet:
do_something_to(row)
so you could count like:
count = 0
for row in active_worksheet:
count += 1
EDIT: This is a better solution: Is it possible to get an Excel document's row count without loading the entire document into memory?
Read-only mode works row-by-row on the source so you probably want to hook it into it. Alternatively, you could pass the cells of the of a worksheet into something like a Pandas matrix which has indices for empty cells.

BIRT report cross tabs: How to calculate and display durations of time?

I have a BIRT report that displays some statistics of calls to a certain line on certain days. Now I have to add a new measeure called "call handling time". The data is collected from a MySQL DB:
TIME_FORMAT(SEC_TO_TIME(some calculations on the duration of calls in seconds),'%i:%s') AS "CHT"
I fail to display the duration in my crosstab in a "mm:ss"-format even when not converting to String. I can display the seconds by not converting them to a time/string but that's not very human readable.
Also I am supposed to add a "grand total" which calculates the average over all days. No problem when using seconds but I have no idea how to do that in a time format.
Which data types/functoins/expressions/settings do I have to use in the query, Data Cube definition and the cross tab cell to make it work?
Time format is not a duration measure, it cannot be summarized or used for an average. A solution is to keep "seconds" as measure in the datacube to compute aggregations, and create a derived measure for display.
In your datacube, select this "seconds" measure and click "add" to create a derived measure. I would use BIRT math functions to build this expression:
BirtMath.round(measure["seconds"]/60)+":"+BirtMath.mod(measure["seconds"],60)
Here are some things to watch out for: seconds are displayed as single digit values (if <10). The "seconds" values this is based on is not an integer, so I needed another round() for the seconds as well, which resulted in seconds sometimes being "60".
So I had to introduce some more JavaScript conditions to display the correct formatting, including not displaying at all if "0:00".
For the "totals" column I used the summary total of the seconds value and did the exact same thing as below.
This is the actual script I ended up using:
if (measure["seconds"] > 0)
{
var seconds = BirtMath.round(BirtMath.mod(measure["seconds"],60));
var minutes = BirtMath.round(measure["seconds"]/60);
if(seconds == 60)
{
seconds = 0;
}
if (seconds < 10)
{
minutes + ":0" + seconds;
}
else
{
minutes + ":" + seconds;
}
}

How do I break up high-cpu requests on Google App Engine?

To give an example of the kind of request that I can't figure out what else to do for:
The application is a bowling score/stat tracker. When someone enters their scores in advanced mode, a number of stats are calculated, as well as their score. The data is modeled as:
Game - members like name, user, reference to the bowling alley, score
Frame - pinfalls for each ball, boolean lists for which pins were knocked down on each ball, information about the path of the ball (stance, target, where it actually went), the score as of that frame, etc
GameStats - stores calculated statistics for the entire game, to be merged with other game stats as needed for statistics display across groups of games.
An example of this information in practice can be found here.
When a game is complete, and a frame is updated, I have to update the game, the frame, every frame after it and possibly some before it (to make sure their scores are correct), and the stats. This operation always flags the CPU monitor. Even if the game isn't complete, and statistics don't need to be calculated, the scores and such need to be updated to show the real-time progress to the user, and so these also get flagged. The average CPU time for this handler is over 7000 mcycles, and it doesn't even display a view. Most people bowl 3 to 4 games per series - if they are entering their scores realtime, at the lanes, that's about 1 request every 2 to 4 minutes, but if they write it all down and enter it later, there are 30-40 of these requests being made in a row.
As requested, the data model for the important classes:
class Stats(db.Model):
version = db.IntegerProperty(default=1)
first_balls=db.IntegerProperty(default=0)
pocket_tracked=db.IntegerProperty(default=0)
pocket=db.IntegerProperty(default=0)
strike=db.IntegerProperty(default=0)
carry=db.IntegerProperty(default=0)
double=db.IntegerProperty(default=0)
double_tries=db.IntegerProperty(default=0)
target_hit=db.IntegerProperty(default=0)
target_missed_left=db.IntegerProperty(default=0)
target_missed_right=db.IntegerProperty(default=0)
target_missed=db.FloatProperty(default=0.0)
first_count=db.IntegerProperty(default=0)
first_count_miss=db.IntegerProperty(default=0)
second_balls=db.IntegerProperty(default=0)
spare=db.IntegerProperty(default=0)
single=db.IntegerProperty(default=0)
single_made=db.IntegerProperty(default=0)
multi=db.IntegerProperty(default=0)
multi_made=db.IntegerProperty(default=0)
split=db.IntegerProperty(default=0)
split_made=db.IntegerProperty(default=0)
class Game(db.Model):
version = db.IntegerProperty(default=3)
user = db.UserProperty(required=True)
series = db.ReferenceProperty(Series)
score = db.IntegerProperty()
game_number = db.IntegerProperty()
pair = db.StringProperty()
notes = db.TextProperty()
simple_entry_mode = db.BooleanProperty(default=False)
stats = db.ReferenceProperty(Stats)
complete = db.BooleanProperty(default=False)
class Frame(db.Model):
version = db.IntegerProperty(default=1)
user = db.UserProperty()
game = db.ReferenceProperty(Game, required=True)
frame_number = db.IntegerProperty(required=True)
first_count = db.IntegerProperty(required=True)
second_count = db.IntegerProperty()
total_count = db.IntegerProperty()
score = db.IntegerProperty()
ball = db.ReferenceProperty(Ball)
stance = db.FloatProperty()
target = db.FloatProperty()
actual = db.FloatProperty()
slide = db.FloatProperty()
breakpoint = db.FloatProperty()
pocket = db.BooleanProperty()
pocket_type = db.StringProperty()
notes = db.TextProperty()
first_pinfall = db.ListProperty(bool)
second_pinfall = db.ListProperty(bool)
split = db.BooleanProperty(default=False)
A few suggestions:
You could store the stats for frames as part of the same entity as the game, rather than having a separate entity for each, by storing it as a list of bitfields (stored in integers) for the pins standing at the end of each half-frame, for example. Let me know if you want more details on how this would be implemented.
Failing that, you can calculate some of the more interrelated stats on fetch. For example, calculating the score-so-far ought to be simple if you have the whole game loaded at once, which means you can avoid having to update multiple frames on every request.
We can be of more help if you show us your data model. :)

Resources