As a part of a large QC benchmark I am creating a large number (approx 100K) of scatter plots in a single PDF using PdfPages backend. (See further down for the code)
The issue I am having is that the plotting takes too much time, see output from a custom profiling/debugging effort:
Checkpoint1: Predictions done in 1.110076904296875 millis
Checkpoint2: df created and correlations calculated in 3.108978271484375 millis
Checkpoint3: plotting and accumulating done in 231.31990432739258 millis
Cycle completed in 0.23553895950317383 secs
----------------------
Checkpoint1: Predictions done in 3.718852996826172 millis
Checkpoint2: df created and correlations calculated in 2.353191375732422 millis
Checkpoint3: plotting and accumulating done in 155.93385696411133 millis
Cycle completed in 0.16200590133666992 secs
----------------------
Checkpoint1: Predictions done in 2.920866012573242 millis
Checkpoint2: df created and correlations calculated in 1.995086669921875 millis
Checkpoint3: plotting and accumulating done in 161.8819236755371 millis
Cycle completed in 0.16679787635803223 secs
The figure for plotting gets an 2-3x increase if I annotate the points, which is necessary for the use case. As you can see below I have tried both itertuples() and apply(), switching to apply did not give a significant change in the times as far as I can see.
def annotate(row, ax):
ax.annotate(row.name, (row.exp, row.model),
xytext=(10, 20), textcoords='offset points',
arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
family='sans-serif', fontsize=8, color='darkslategrey')
def plot2File(df, file, seq, z, p, s):
""" Plot predictions vs experimental """
plttitle = f"Correlations for {seq}+{z} \n pearson={p} \n spearman={s}"
ax = df.plot(x='exp', y='model', kind='scatter', title=plttitle, s=40)
df.apply(annotate, ax=ax, axis=1)
# for row in df.itertuples():
# ax.annotate(row.Index, (row.exp, row.model),
# xytext=(10, 20), textcoords='offset points',
# arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
# family='sans-serif', fontsize=8, color='darkslategrey')
plt.savefig(file, bbox_inches='tight', format='pdf')
plt.close()
Given the nice explanation by Jeff on a question regarding iterrows() I was wondering if it would be possible to vectorize the annotation process? Or should I ditch using a data frame altogether?
Related
I have a list of time in a decimal format of seconds, and I know what time the series started. I would like to convert it to a time of day with the offset of the start time applied. There must be a simple way to do this that I am really missing!
Sample source data:
\Name of source file : 260521-11_58
\Recording from 26.05.2021 11:58
\Channels : 1
\Scan rate : 101 ms = 0.101 sec
\Variable 1: n1(rpm)
\Internal identifier: 63
\Information1:
\Information2:
\Information3:
\Information4:
0.00000 3722.35645
0.10100 3751.06445
0.20200 1868.33350
0.30300 1868.36487
0.40400 3722.39355
0.50500 3722.51831
0.60600 3722.50464
0.70700 3722.32446
0.80800 3722.34277
0.90900 3722.47729
1.01000 3722.74048
1.11100 3722.66650
1.21200 3722.39355
1.31300 3751.02710
1.41400 1868.27539
1.51500 3722.49097
1.61600 3750.93286
1.71700 1868.30334
1.81800 3722.29224
The Start time & date is 26.05.2021 11:58, and the LH column is elapsed time in seconds with the column name [Time] . So I just want to convert the decimal / real to a time or timespan and add the start time to it.
I have tried lots of ways that are really hacky, and ultimately flawed - the below works, but just ignores the milliseconds.
TimeSpan(0,0,0,Integer(Floor([Time])),[Time] - Integer(Floor([Time])))
The last part works to just get milli / micro seconds on its own, but not as part of the above.
Your formula isn't really ignoring the milliseconds, you are using the decimal part of your time (in seconds) as milliseconds, so the value being returned is smaller than the format mask.
You need to convert the seconds to milliseconds, so something like this should work
TimeSpan(0,0,0,Integer(Floor([Time])),([Time] - Integer(Floor([Time]))) * 1000)
To add it to the time, this would work
DateAdd(Date("26-May-2021"),TimeSpan(0,0,0,Integer([Time]),([Time] - Integer([Time])) * 1000))
You will need to set the column format to
dd-MMM-yyyy HH:mm:ss:fff
I have the following problem:
I have an autoencoder in Keras, and train it for a few epochs. The training overview shows a validation MAE of 0.0422 and an MSE of 0.0024.
However, if I then call network.predict and manually calculate the validation errors, I get 0.035 and 0.0024.
One would assume that my manual calculation of the MAE is simply incorrect, but the weird thing is that if I use an identity model (simply outputs what you input) and use that to evaluate the predicted values, the same error value is returned as for my manual calculation. The code looks as follows:
input = Input(shape=(X_train.shape[1], ))
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(input)
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
encoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
decoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(encoded)
decoded = Dense(50, activation='relu', activity_regularizer=regularizers.l1(10e-5))(decoded)
decoded = Dense(X_train.shape[1], activation='sigmoid')(decoded)
network = Model(input, decoded)
# sgd = SGD(lr=8, decay=1e-6)
# network.compile(loss='mean_squared_error', optimizer='adam')
network.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mse'])
# Fitting the data
network.fit(X_train, X_train, epochs=2, batch_size=1, shuffle=True, validation_data=(X_valid, X_valid),
callbacks=[EarlyStopping(monitor='val_loss', min_delta=0.00001, patience=20, verbose=0, mode='auto')])
# Results
recon_valid = network.predict(X_valid, batch_size=1)
score2 = network.evaluate(X_valid, X_valid, batch_size=1, verbose=0)
print('Network evaluate result: mae={}, mse={}'.format(*score2))
x = Input((X_train.shape[1],))
m = Model(x, x)
m.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mse'])
score1 = m.evaluate(recon_valid, X_valid, batch_size=1, verbose=0)
print('Identity evaluate result: mae={}, mse={}'.format(*score1))
errors_test = np.absolute(X_valid - recon_valid)
print("Manual MAE: {}".format(np.average(errors_test)))
errors_test = np.square(X_valid - recon_valid)
print("Manual MSE: {}".format(np.average(errors_test)))
Which outputs the following:
Train on 282 samples, validate on 94 samples
Epoch 1/2
2018-04-18 17:24:01.464947: I C:\tf_jenkins\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
282/282 [==============================] - 0s - loss: 0.0861 - mean_squared_error: 0.0187 - val_loss: 0.0451 - val_mean_squared_error: 0.0025
Epoch 2/2
282/282 [==============================] - 0s - loss: 0.0440 - mean_squared_error: 0.0025 - val_loss: 0.0422 - val_mean_squared_error: 0.0024
Network evaluate result: mae=0.04216482736011769, mse=0.0024067993242382767
Identity evaluate result: mae=0.03506102238563781, mse=0.0024067993242382767
Manual MAE: 0.03506102412939072
Manual MSE: 0.002406799467280507
I know that my manual calculation is correct, since the identity model (m) returns the same value. The only possible explanation for the difference in MAE values would then be if network.evaluate(X_valid, X_valid) somehow uses different values than those returned by network.predict(X_valid), but then the MSE would also be different.
This leaves me completely confused, thinking there might be a bug in the Keras MAE calculation. Has anyone had this issue before or have any ideas how it might be fixed? I am using the Tensorflow backend.
Any help would be much appreciated!
EDIT: I'm almost certain this is a bug. If I keep loss='mae' but also add metrics=['mse', 'mae'], the MAE returned by the metrics is the same as my manual computation and the identity model. The same is true for MSE: if I set loss='mse', the MSE returned by the metric is different from the loss.
It turns out that the loss is supposed to be different than the metric, because of the regularization. Using regularization, the loss is higher (in my case), because the regularization increases loss when the nodes are not as active as specified. The metrics don't take this into account, and therefore return a different value, which equals what one would get when manually computing the error.
The metrics during training and validation are different because of different reasons:
The dataset is different
During trainning the weights are changing in every step so the metrics are changing too
The metric during training is of the current batch of data or a running average of the metrics of the last batches. For the evaluation, the metric is for the whole dataset.
I'm working on groovy code perfomance optimization. I've used jvisualvm to connect to running applicaton and gather CPU samples. Samples say that org.codehaus.groovy.reflection.CachedMethod.inkove takes the most CPU time. I don't see any other application methods in samples.
What is the right way to dig into CachedMethod.invoke and understand what code lines really give perfomance penalties?
Thanks.
UPD:
I do use Indy, it didn't help me.
I didn't try to introduce #CompileStatic since I want to find my bottlenecks before rewriting groovy to java.
My problem a bit similar to this thread: Call site caching faster than invokedynamic?
I have a code that dynamically composes groovy script. Script template looks this way:
def evaluateExpression(Map context){
def user = context.user
%s
}
where %s replaced with
user.attr1 == '1' || user.attr2 == '2' || user.attr3 = '3'
There is a set (20 in total) of replacements have taken from Databases.
The code gets replacements from DB, creates GroovyScript and evaluates it.
I suppose the bottleneck is in the script execution. What is the right way to fix it?
So, I've tried various things
groovy-indy, doesn't work
groovy-indy with some code "optimization", doesn't work. BTW, I'started to play around with try/catch and it as a result I made my "hotspot" run 4 times faster. I'm not good at JVM internals, but internet says - try/catch prevents optimizations. I assumed it as a ground truth. Need to g deeper to understand who it really works.
I gave up, turned off invokedynamic and rewrote my "hottest" code with #CompileStatic. It took about 3-4 hours and I my code runs 100 time faster now.
Here are initial metrics with "invokedynamic support"
count = 83043
mean rate = 395.52 calls/second
1-minute rate = 555.30 calls/second
5-minute rate = 217.78 calls/second
15-minute rate = 82.92 calls/second
min = 0.29 milliseconds
max = 12.98 milliseconds
mean = 1.59 milliseconds
stddev = 1.08 milliseconds
median = 1.39 milliseconds
75% <= 2.46 milliseconds
95% <= 3.14 milliseconds
98% <= 3.44 milliseconds
99% <= 3.76 milliseconds
99.9% <= 12.19 milliseconds
Here are #CompileStatic metrics with ind turned off. BTW, there is no reason to use #CompileStatic if "indy" is turned on.
count = 139724
mean rate = 8950.43 calls/second
1-minute rate = 2011.54 calls/second
5-minute rate = 426.96 calls/second
15-minute rate = 143.76 calls/second
min = 0.02 milliseconds
max = 24.18 milliseconds
mean = 0.08 milliseconds
stddev = 0.72 milliseconds
median = 0.06 milliseconds
75% <= 0.08 milliseconds
95% <= 0.11 milliseconds
98% <= 0.15 milliseconds
99% <= 0.20 milliseconds
99.9% <= 1.27 milliseconds
I am starting a discussion, which I hope, will become one place to discuss data loading method using mutators Vs. loading using flat file via 'LOAD DATA INFILE'.
I have been baffled to get enormous performance gain using mutators (using batch size = 1000 or 10000 or 100K et cetera).
My project involved loading close to 400 million rows of social media data into HyperTable to be used for real time analytics. It took me close to 3 days to just load just 1 million row of data (code sample below). Each row is approximately 32 byte. So, in order to avoid taking 2-3 weeks to load this much data, I prepared a flat file with rows and used DATA LOAD INFILE method. Performance gain was amazing. Using this method, loading rate was 368336 cells/sec.
See below for actual snapshot of action:
hypertable> LOAD DATA INFILE "/data/tmp/users.dat" INTO TABLE users;
Loading 7,113,154,337 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 508.07 s
Avg key size: 8.92 bytes
Total cells: 218976067
Throughput: 430998.80 cells/s
Resends: 2210404
hypertable> LOAD DATA INFILE "/data/tmp/graph.dat" INTO TABLE graph;
Loading 12,693,476,187 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 1189.71 s
Avg key size: 17.48 bytes
Total cells: 437952134
Throughput: 368118.13 cells/s
Resends: 1483209
Why is performance difference between 2 method is so vast? What's the best way to enhance mutator performance. Sample mutator code is below:
my $batch_size = 1000000; # or 1000 or 10000 make no substantial difference
my $ignore_unknown_cfs = 2;
my $ht = new Hypertable::ThriftClient($master, $port);
my $ns = $ht->namespace_open($namespace);
my $users_mutator = $ht->mutator_open($ns, 'users', $ignore_unknown_cfs, 10);
my $graph_mutator = $ht->mutator_open($ns, 'graph', $ignore_unknown_cfs, 10);
my $keys = new Hypertable::ThriftGen::Key({ row => $row, column_family => $cf, column_qualifier => $cq });
my $cell = new Hypertable::ThriftGen::Cell({key => $keys, value => $val});
$ht->mutator_set_cell($mutator, $cell);
$ht->mutator_flush($mutator);
I would appreciate any input on this? I don't have tremendous amount of HyperTable experience.
Thanks.
If it's taking three days to load one million rows, then you're probably calling flush() after every row insert, which is not the right thing to do. Before I describe hot to fix that, your mutator_open() arguments aren't quite right. You don't need to specify ignore_unknown_cfs and you should supply 0 for the flush_interval, something like this:
my $users_mutator = $ht->mutator_open($ns, 'users', 0, 0);
my $graph_mutator = $ht->mutator_open($ns, 'graph', 0, 0);
You should only call mutator_flush() if you would like to checkpoint how much of the input data has been consumed. A successful call to mutator_flush() means that all data that has been inserted on that mutator has durably made it into the database. If you're not checkpointing how much of the input data has been consumed, then there is no need to call mutator_flush(), since it will get flushed automatically when you close the mutator.
The next performance problem with your code that I see is that you're using mutator_set_cell(). You should use either mutator_set_cells() or mutator_set_cells_as_arrays() since each method call is a round-trip to the ThriftBroker, which is expensive. By using the mutator_set_cells_* methods, you amortize that round-trip over many cells. The mutator_set_cells_as_arrays() method can be more efficient for languages where object construction overhead is large in comparison to native datatypes (e.g. string). I'm not sure about Perl, but you might want to give that a try to see if it boosts performance.
Also, be sure to call mutator_close() when you're finished with the mutator.
I have stock data at the tick level and would like to create a rolling list of all ticks for the previous 10 seconds. The code below works, but takes a very long time for large amounts of data. I'd like to vectorize this process or otherwise make it faster, but I'm not coming up with anything. Any suggestions or nudges in the right direction would be appreciated.
library(quantmod)
set.seed(150)
# Create five minutes of xts example data at .1 second intervals
mins <- 5
ticks <- mins * 60 * 10 + 1
times <- xts(runif(seq_len(ticks),1,100), order.by=seq(as.POSIXct("1973-03-17 09:00:00"),
as.POSIXct("1973-03-17 09:05:00"), length = ticks))
# Randomly remove some ticks to create unequal intervals
times <- times[runif(seq_along(times))>.3]
# Number of seconds to look back
lookback <- 10
dist.list <- list(rep(NA, nrow(times)))
system.time(
for (i in 1:length(times)) {
dist.list[[i]] <- times[paste(strptime(index(times[i])-(lookback-1), format = "%Y-%m-%d %H:%M:%S"), "/",
strptime(index(times[i])-1, format = "%Y-%m-%d %H:%M:%S"), sep = "")]
}
)
> user system elapsed
6.12 0.00 5.85
You should check out the window function, it will make your subselection of dates a lot easier. The following code uses lapply to do the work of the for loop.
# Your code
system.time(
for (i in 1:length(times)) {
dist.list[[i]] <- times[paste(strptime(index(times[i])-(lookback-1), format = "%Y-%m-%d %H:%M:%S"), "/",
strptime(index(times[i])-1, format = "%Y-%m-%d %H:%M:%S"), sep = "")]
}
)
# user system elapsed
# 10.09 0.00 10.11
# My code
system.time(dist.list<-lapply(index(times),
function(x) window(times,start=x-lookback-1,end=x))
)
# user system elapsed
# 3.02 0.00 3.03
So, about a third faster.
But, if you really want to speed things up, and you are willing to forgo millisecond accuracy (which I think your original method implicitly does), you could just run the loop on unique date-hour-second combinations, because they will all return the same time window. This should speed things up roughly twenty or thirty times:
dat.time=unique(as.POSIXct(as.character(index(times)))) # Cheesy method to drop the ms.
system.time(dist.list.2<-lapply(dat.time,function(x) window(times,start=x-lookback-1,end=x)))
# user system elapsed
# 0.37 0.00 0.39