StratifiedShuffleSplit to get train test, valid data set - sklearn-pandas

Need to get train, test, validation dataset using StratifiedShuffleSplit.
Need to get
60% train, 20% test, 20% validation, so used StratifiedShuffleSplit twice to get three parts.
So did not use fraction to use StratifiedShuffleSplit twice, I set number of records in test.
s = [['a',1,'c'], ['AA',0,'CC'],['a',1,'c'],['AA',1,0],['a',1,'d'],['a',0,3],['AA',1,5],['AA',0,8],['a',1,7],['AA',0,8],['a',0,8],['AA',0,9]]
df = pd.DataFrame(s, columns=['input', 'Categories', 'labels'])
X = df[['input']]
Y = df['Categories']
from sklearn.model_selection import StratifiedShuffleSplit
ff = StratifiedShuffleSplit(n_splits=1, test_size=2, random_state=8, )
for train_idx, test_idx in ff.split(X, Y):
X_train_temp=X[X.index.isin(train_idx)]
Y_train_temp=Y[Y.index.isin(train_idx)]
X_validation=X[X.index.isin(test_idx)]
Y_validation=Y[Y.index.isin(test_idx)]
for train_idx, test_idx in ff.split(X_train_temp, Y_train_temp):
X_train=X_train_temp[X_train_temp.index.isin(train_idx)]
Y_train=Y_train_temp[Y_train_temp.index.isin(train_idx)]
X_test=X_train_temp[X_train_temp.index.isin(test_idx)]
Y_test=Y_train_temp[Y_train_temp.index.isin(test_idx)]
Somehow X_train, X_test, X_validation number of rows add up together equals to 11 which does not equal to total row number of X of 12.
Tried this on a larger dataset, somehow
X_train.shape[0]+X_test.shape[0]+X_validation.shape[0].
is less than the original X.shape[0]. Do you know what might cause this?
Is there a more elegent way to do this train, test, validation using StratifiedShuffleSplit?

Related

Performance issue when finding/assigning the similarity value between sentences within two dataframes

I am trying to calculate the similarity value between lists of strings using spacy word2vec, but the code is talking so much time, and google colab stops working at the end.
The code I come-up with is mentioned below; mainly I have two dataframes, the first includes a list of comments (more than 1.5 million) while the second includes a set of LDA topics represented as topic name and keywords (39 topics). What is required is to create a new column (within the first dataframe) holding the similarity value between the comments and each of the topics' keywords (i.e. 39 columns to be added to the first dataframe, each one represents the similarity values between the comments and one topic).
I run the code for small data set and it worked fine. However for the 1.5M comments and 39 topics keywords, it for more than 2.5 hours then stops. I am not sure if this is the optimal code to achieve the task, any advise is appreciated.
The code is:
for index, row in Post_sent_df.iterrows(): #first dataframe
row = Post_sent_df['Sent_text'][index]
doc1 = nlp2(row)
if doc1.vector_norm:
for index_tp, row_tp in topics_words_df.iterrows(): #second dataframe
row_tp = topics_words_df['TopicKeyWords'][index_tp]
doc2 = nlp2(row_tp)
if doc2.vector_norm:
sim_value = (doc1.similarity(doc2))
col_name = str(index_tp)
Post_sent_df.at[index , index_tp] = sim_value
As gojomo mentioned in his comments, most of the time is used to run the nlp2() function without a real need for its processing, and as I just want to calculate the similarity between word2vectors, I decided to use nlp() through an apply function to calculate the word2vec for the comments, and do the same for the topics, and then loop through the generated word2vecs to calculate the cosine similarity manually, below is the code I used:
#Define function to get word2vec for a sentence
def get_vec(x):
doc = nlp2(x)
vec = doc.vector
return vec
#calculate vec for keywords
topics_words_df['key_words_vec'] = topics_words_df['TopicKeyWords'].apply(lambda x: get_vec(x))
#calculate vec for comments
Post_sent_df['Sent_vec'] = Post_sent_df['Sent_text'].apply(lambda x: get_vec(x))
# calculate cosine similarity
for index, row in Post_sent_df.iterrows():
row = Post_sent_df['Sent_vec'][index]
for index_tp, row_tp in topics_words_df.iterrows():
row_tp = topics_words_df['key_words_vec'][index_tp]
cosine_similarity = np.dot(row, row_tp)/(np.linalg.norm(row)* np.linalg.norm(row_tp))
col_name = str(index_tp)
Post_sent_df.at[index , index_tp] = cosine_similarity

Does Tensorflow keep track of train/validation split?

There is a useful utility function in Tensorflow that makes it really simple to load a dataset made of images as a Tensorflow dataset, namely tf.keras.utils.image_dataset_from_directory.
In the tutorial at this page here, the following operations are performed sequentially in order to obtain a training and a validation dataset:
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
My question is: does Tensorflow keep track of which images were placed in the training dataset, in order not to accidentally pick the same images for the validation set? Or could there be duplicates?
The 'validation_split' splits data into 2 classes by order of index. i.e. first 80% data will be in train_ds and remaining 20% data will be in test_ds.
=> Yes, duplicates is possible if you use validation_split value more than 20% in test_ds.

h2o H2OAutoEncoderEstimator

I was trying to detect outliers using the H2OAutoEncoderEstimator.
Basically I load 4 KPIs from a CSV file.
For each KPI I have 1 month of data.
The data in the CSV file has been manually created and are all the same for each KPI
The following picture shows the trend of the KPIs:
The first black vertical line (x=4000) indicates the end of the training data.
All the others light black vertical lines indicate the data that I use to detect the outliers every time.
As you can see data are very regular (I'v copied & pasted first 1000 rows 17 times).
This is what my code does:
Loads the data from a CSV file (1 row represents the value of all kpis in a specific timestamp)
Trains the model using the first 4000 timestamps
Starting from the 4001 timestamp, every 250 Timestamps it calls the function model.anomaly to detect the outliers in a specific window (250 timestamps)
My questions are:
Is it normal that every time that I call the function model.anomaly the errors returned increases every time (from 0.1 to 1.8)?
If I call again model.train, the training phase will be performed from scratch replacing the existing model or it will be updated with the new data provided?
This is my python code:
data = loadDataFromCsv()
nOfTimestampsForTraining = 4000
frTrain = h2o.H2OFrame(data[:nOfTimestampsForTraining])
colsName = frTrain.names
model = H2OAutoEncoderEstimator(activation="Tanh",
hidden=[5,4,3],
l1=1e-5,
ignore_const_cols=False,
autoencoder=True,
epochs=100)
# Init indexes
nOfTimestampsForWindows = 250
fromIndex = nOfTimestampsForWindows
toIndex = fromWindow + nOfTimestampsForWindows
# Perform the outlier detection every nOfTimestampsForWindows TimeStamps
while toIndex <= len(data) :
frTest = h2o.H2OFrame(data[fromWindow:toWindow])
error = model.anomaly(frTest)
df = error.as_data_frame()
print(df)
print(df.describe())
# Adjust indexes for the next window
fromIndex = toIndex
toIndex = fromIndex + nOfTimestampsForWindows

Displaying Max, Min, Avg across bar chart Tableau

I have a bar chart with X axis as discrete date value and Y axis as number of records.
eg: x axis (Filtered Date)- 1st Oct, 2nd Oct, 3rd Oct etc
y axis (Number of Records)- 30, 4, 3 etc
Now, I have to create a table to get Max, Min and Avg. Value of the 'Number of Record'.
I have written a Calculated Field as MAX([Number of Records]) to get the maximum of Number of Records in this case 30 but I always get a value of 1.
How do I define the values to get max, min and avg. ?
Thanks,
Number of Records is an automatically calculated field that tableau generates when importing a datasource. You can right click on it and see the definition of the calculation: 1.
As you currently have your field defined, tableau will look for the maximum value of the column. It will always be 1 because that is the only value in that field for every record.
It sounds like you are actually trying to calculate the maxiuum of the sum of the number of records for your aggregation level (in your case date). You should be able to easily accomplish this using Level of Detail (LOD) expressions, or table calculations. Something like the following:
WINDOW_MAX(SUM([Number of Records]))

R - Sorting and Sub-setting Maximum Values within Columns

I am trying to iteratively sort data within columns to extract N maximum values.
My data is set up with the first and second columns containing occupation titles and codes, and all of the rest of the columns containing comparative values (in this case location quotients that had to be previously calculated for each city) for those occupations for various cities:
*occ_code city1 ... city300*
occ1 5 ... 7
occ2 20 ... 22
. . . .
. . . .
occ800 20 ... 25
For each city I want to sort by the maximum values, select a subset of those maximum values matched by their respective occupations titles and titles. I thought it would be relatively trivial but...
edit for clarification: I want end to with a sorted subset of the data for analysis.
occ_code city1
occ200 10
occ90 8
occ20 2
occ95 1.5
At the same time I want to be able to repeat the sort column-wise (so I've tried lots of order commands through calling columns directly: data[,2]; just to be able to run the same analysis functions over the entire dataset.
I've been messing with plyr for the past 3 days and I feel like the setup of my dataset is just not conducive to how plyer was meant to be used.
I'm not exactly sure what your desired output is according to your example snippit. Here's how you could get a data frame like that for every city using plyr and reshape
#using the same df from nico's answer
library(reshape)
df.m <- melt(df, id = 1)
a.cities <- cast(df.m, codes ~ . | variable)
library(plyr)
a.cities.max <- aaply(a.cities, 1, function(x) arrange(x, desc(`(all)`))[1:4,])
Now, a.cities.max is an array of data frames, with the 4 largest values for each city in each data frame. To get one of these data frames, you can index it with
a.cities.max$X13
I don't know exactly what you'll be doing with this data, but you might want it back in data frame format.
df.cities.max <- adply(a.cities.max, 1)
One way would be to use order with ddply from the package plyr
> library(plyr)
> d<-data.frame(occu=rep(letters[1:5],2),city=rep(c('A','B'),each=5),val=1:10)
> ddply(d,.(city),function(x) x[order(x$val,decreasing=TRUE)[1:3],])
order can sort on multiple columns if you want that.
This will output the max for each city. Similar results can be obtained using sort or order
# Generate some fake data
codes <- paste("Code", 1:100, sep="")
values <- matrix(0, ncol=20, nrow=100)
for (i in 1:20)
values[,i] <- sample(0:100, 100, replace=T)
df <- data.frame(codes, values)
names(df) <- c("Code", paste("City", 1:20, sep=""))
# Now for each city we get the maximum
maxval <- apply(df[2:21], 2, which.max)
# Output the max for each city
print(cbind(paste("City", 1:20), codes[maxval]))

Resources