There is a useful utility function in Tensorflow that makes it really simple to load a dataset made of images as a Tensorflow dataset, namely tf.keras.utils.image_dataset_from_directory.
In the tutorial at this page here, the following operations are performed sequentially in order to obtain a training and a validation dataset:
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
My question is: does Tensorflow keep track of which images were placed in the training dataset, in order not to accidentally pick the same images for the validation set? Or could there be duplicates?
The 'validation_split' splits data into 2 classes by order of index. i.e. first 80% data will be in train_ds and remaining 20% data will be in test_ds.
=> Yes, duplicates is possible if you use validation_split value more than 20% in test_ds.
Related
I am running a PySpark application where we are comparing two large datasets of 3GB each. There are some differences in the datasets, which we are filtering via outer join.
mismatch_ids_row = (sourceonedf.join(sourcetwodf, on=primary_key,how='outer').where(condition).select(primary_key)
mismatch_ids_row.count()
So the output of join on count is a small data of say 10 records. The shuffle partition at this point is about 30 which has been counted as amount of data/partition size(100Mb).
After the result of the join, the previous two datasets are joined with the resultant joined datasets to filter out data for each dataframe.
df_1 = sourceonedf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
df_2 = sourcetwodf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
Here we are dropping duplicates since the result of first join will be double via outer join where some values are null.
These two dataframes are further joined to find the column level comparison and getting the exact issue where the data is mismatched.
df = (df_1.join(df_2,on=some condition, how="full_outer"))
result_df = df.count()
The resultant dataset is then used to display as:
result_df.show()
The issue is that, the first join with more data is using merge sort join with partition size as 30 which is fine since the dataset is somewhat large.
After the result of the first join has been done, the mismatched rows are only 10 and when joining with 3Gb is a costly operation and using broadcast didn't help.
The major issue in my opinion comes when joining two small resultant datasets in second join to produce the result. Here too many shuffle partitions are killing the performance.
The application is running in client mode as spark run for testing purposes and the parameters are sufficient for it to be running on the driver node.
Here is the DAG for the last operation:
As an example:
data1 = [(335008138387,83165192,"yellow","2017-03-03",225,46),
(335008138384,83165189,"yellow","2017-03-03",220,4),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
data2 = [(335008138387,83165192,"yellow","2017-03-03",300,46),
(335008138384,83165189,"yellow","2017-03-03",220,10),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
field = [
StructField("row_num",LongType(),True),
StructField("tripid",IntegerType(),True),
StructField("car_type",StringType(),True),
StructField("dates", StringType(), True),
StructField("pickup_location_id", IntegerType(), True),
StructField("trips", IntegerType(), True)
]
schema = StructType(field)
sourceonedf = spark.createDataFrame(data=data1,schema=schema)
sourcetwodf = spark.createDataFrame(data=data2,schema=schema)
They have just two differences, on a larger dataset think of these as 10 or more differences.
df_1 will get rows from 1st sourceonedf based on mismatch_ids_row and so will the df_2. They are then joined to create another resultant dataframe which outputs the data.
How can we optimize this piece of code so that optimum partitions are there for it to perform faster that it does now.
At this point it takes ~500 secs to do whole activity, when it can take about 200 secs lesser and why does the show() takes time as well, there are only 10 records so it should print pretty fast if all are in 1 partition I guess.
Any suggestions are appreciated.
You should be able to go without df_1 and df_2. After the first 'outer' join you have all the data in that table already.
Cache the result of the first join (as you said, the dataframe is small):
# (Removed the select after the first join)
mismatch_ids_row = sourceonedf.join(sourcetwodf, on=primary_key, how='outer').where(condition)
mismatch_ids_row.cache()
mismatch_ids_row.count()
Then you should be able to create a self-join condition. When joining, use dataframe aliases for explicit control:
result_df = (
mismatch_ids_row.alias('a')
.join(mismatch_ids_row.alias('b'), on=some condition...)
.select(...)
)
Need to get train, test, validation dataset using StratifiedShuffleSplit.
Need to get
60% train, 20% test, 20% validation, so used StratifiedShuffleSplit twice to get three parts.
So did not use fraction to use StratifiedShuffleSplit twice, I set number of records in test.
s = [['a',1,'c'], ['AA',0,'CC'],['a',1,'c'],['AA',1,0],['a',1,'d'],['a',0,3],['AA',1,5],['AA',0,8],['a',1,7],['AA',0,8],['a',0,8],['AA',0,9]]
df = pd.DataFrame(s, columns=['input', 'Categories', 'labels'])
X = df[['input']]
Y = df['Categories']
from sklearn.model_selection import StratifiedShuffleSplit
ff = StratifiedShuffleSplit(n_splits=1, test_size=2, random_state=8, )
for train_idx, test_idx in ff.split(X, Y):
X_train_temp=X[X.index.isin(train_idx)]
Y_train_temp=Y[Y.index.isin(train_idx)]
X_validation=X[X.index.isin(test_idx)]
Y_validation=Y[Y.index.isin(test_idx)]
for train_idx, test_idx in ff.split(X_train_temp, Y_train_temp):
X_train=X_train_temp[X_train_temp.index.isin(train_idx)]
Y_train=Y_train_temp[Y_train_temp.index.isin(train_idx)]
X_test=X_train_temp[X_train_temp.index.isin(test_idx)]
Y_test=Y_train_temp[Y_train_temp.index.isin(test_idx)]
Somehow X_train, X_test, X_validation number of rows add up together equals to 11 which does not equal to total row number of X of 12.
Tried this on a larger dataset, somehow
X_train.shape[0]+X_test.shape[0]+X_validation.shape[0].
is less than the original X.shape[0]. Do you know what might cause this?
Is there a more elegent way to do this train, test, validation using StratifiedShuffleSplit?
I work for a call center and i need to forecast call volumes
In order to do so i followed these steps :
-Filling missing values with linear interpolation
-Divided my data into trend+residual+seasonality
-Made my data stationnary using stat model
#read dataset
df= pd.read_excel("df.xlsx", index_col=0)
#define daily frequency
df= df.asfreq(df.index.freq)
#replace missing data with linear intepolation
df.interpolate(method='linear', inplace=True)
#decompose time series using seasonal_decompose from SKlearn
result= seasonal_decompose(df.value, model='mult')
#remove seasonality and trend to make data stationnary
df_non_seasonal=df.value.values/
(result_add.seasonal*result_add.trend)
#Make forecasts with Prophet
from atspy import AutomatedModel
model_list = ["Prophet"]
model = AutomatedModel(df = df_non_seasonal,model_list=model_list)
My model works well to forecast call volumes, my problem is that my trend is very hard to forecast, it seems to be impredictible, i can't find a way to catch it
Do you have any advices for the processing of my trend that would make it easier to predict?
I am trying to calculate the similarity value between lists of strings using spacy word2vec, but the code is talking so much time, and google colab stops working at the end.
The code I come-up with is mentioned below; mainly I have two dataframes, the first includes a list of comments (more than 1.5 million) while the second includes a set of LDA topics represented as topic name and keywords (39 topics). What is required is to create a new column (within the first dataframe) holding the similarity value between the comments and each of the topics' keywords (i.e. 39 columns to be added to the first dataframe, each one represents the similarity values between the comments and one topic).
I run the code for small data set and it worked fine. However for the 1.5M comments and 39 topics keywords, it for more than 2.5 hours then stops. I am not sure if this is the optimal code to achieve the task, any advise is appreciated.
The code is:
for index, row in Post_sent_df.iterrows(): #first dataframe
row = Post_sent_df['Sent_text'][index]
doc1 = nlp2(row)
if doc1.vector_norm:
for index_tp, row_tp in topics_words_df.iterrows(): #second dataframe
row_tp = topics_words_df['TopicKeyWords'][index_tp]
doc2 = nlp2(row_tp)
if doc2.vector_norm:
sim_value = (doc1.similarity(doc2))
col_name = str(index_tp)
Post_sent_df.at[index , index_tp] = sim_value
As gojomo mentioned in his comments, most of the time is used to run the nlp2() function without a real need for its processing, and as I just want to calculate the similarity between word2vectors, I decided to use nlp() through an apply function to calculate the word2vec for the comments, and do the same for the topics, and then loop through the generated word2vecs to calculate the cosine similarity manually, below is the code I used:
#Define function to get word2vec for a sentence
def get_vec(x):
doc = nlp2(x)
vec = doc.vector
return vec
#calculate vec for keywords
topics_words_df['key_words_vec'] = topics_words_df['TopicKeyWords'].apply(lambda x: get_vec(x))
#calculate vec for comments
Post_sent_df['Sent_vec'] = Post_sent_df['Sent_text'].apply(lambda x: get_vec(x))
# calculate cosine similarity
for index, row in Post_sent_df.iterrows():
row = Post_sent_df['Sent_vec'][index]
for index_tp, row_tp in topics_words_df.iterrows():
row_tp = topics_words_df['key_words_vec'][index_tp]
cosine_similarity = np.dot(row, row_tp)/(np.linalg.norm(row)* np.linalg.norm(row_tp))
col_name = str(index_tp)
Post_sent_df.at[index , index_tp] = cosine_similarity
I have successfully used combination of crossfilter, dc, d3 to build multivariate charts for smaller datasets.
My current system caters to 1.5 million txns a day and I want to use the above combination to show dimensional charts on this big sized data (spanned over 6 months). I cannot push this sized data to the frontend for obvious reasons.
The txn data has seconds level granularity but this level of granularity is not required in the visualization. If txn data can be rolled up to a granularity of a day at the backend and push the day based aggregation to the front end then it can drastically reduce the IO traffic and size of the data given to the crossfilter,dc and then dc can show its visualization magic.
Taking forward the above idea -> I decided to reduce the size of the data by reducing the granularity of the timeseries data from millseconds to day by pre-aggregating the data from various dimensions using the below GROUP BY query (this is similar to the stuff done by crossfilter but at the frontend)
SELECT TRUNC(DATELOGGED) AS DTLOGGED, CODE, ACTION, COUNT(*) AS
TXNCOUNT, GROUPING_ID(TRUNC(DATELOGGED),CODE, ACTION) AS grouping_id
FROM AAAA GROUP BY GROUPING SETS(TRUNC(DATELOGGED),
(TRUNC(DATELOGGED),CURR_CODE), (TRUNC(DATELOGGED),ACTION));
Sample output of these rows:
Tuples/Rows in which aggregation is done by (TRUNC(DATELOGGED),CODE) will have a common grouping_id 1 and by (TRUNC(DATELOGGED),ACTION) will have a common grouping_id 2
//group by DTLOGGED, CODE
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"144","ACTION":"", "TXNCOUNT":69,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"376","ACTION":"", "TXNCOUNT":20,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"144","ACTION":"", "TXNCOUNT":254,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"376","ACTION":"", "TXNCOUNT":961,"GROUPING_ID":1},
//group by DTLOGGED, ACTION
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"ENROLLED_PURCHASE", "TXNCOUNT":373600,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"UNENROLLED_PURCHASE", "TXNCOUNT":48978,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"ENROLLED_PURCHASE", "TXNCOUNT":402311,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"UNENROLLED_PURCHASE", "TXNCOUNT":54910,"GROUPING_ID":2},
//group by DTLOGGED
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"", "TXNCOUNT":460732,"GROUPING_ID":3},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"", "TXNCOUNT":496060,"GROUPING_ID":3}];
Questions:
These rows are are dis-joined i.e. not like usual rows where each row will have valid values for CODE and ACTION in a single row.
After a selection is made in one of the graphs, the redrawing effect either removes the other graphs or shows no data on them.
Please give me any troubleshooting help or suggest better ways to solve this?
http://jsfiddle.net/universallocalhost/5qJjT/3/
So there are a couple things going on in this question, so I'll try to separate them:
Crossfilter works with tidy data
http://vita.had.co.nz/papers/tidy-data.pdf
This means that you will need to come up with a naive method of filling in the nulls you're seeing (or if need be, in your initial query of the data, omit the nulled values. If you want to get really fancy, you could even infer the null values based off of other data. Whatever your solution, you need to make your data tidy prior to putting it into crossfilter.
Groups and Filtering Operations
txnVolByCurrcode = txnByCurrcode.group().reduceSum(function(d) {
if(d.GROUPING_ID ===1) {
return d.TXNCOUNT;
} else {
return 0;
}
});
This is a filtering operation done on the reduction. This is something that you should separate. Allow that filtering to occur elsewhere (either in the visual, crossfilter itself, or in the query on the data).
This means your reduceSum's become:
var txnVolByCurrcode = txnByCurrcode.group().reduceSum(function(d) {
return d.TXNCOUNT;
});
And if you would like the user to select which group to display:
var groupId = cfdata.dimension(function(d) { return d.GROUPING_ID; });
var groupIdGroup = groupId.group(); // this is an interesting name
dc.pieChart("#group-chart")
.width(250)
.height(250)
.radius(125)
.innerRadius(50)
.transitionDuration(750)
.dimension(groupId)
.group(groupIdGroup)
.renderLabel(true);
For an example of this working:
http://jsfiddle.net/b67pX/