HyperTable: Loading data using Mutators Vs. LOAD DATA INFILE - performance

I am starting a discussion, which I hope, will become one place to discuss data loading method using mutators Vs. loading using flat file via 'LOAD DATA INFILE'.
I have been baffled to get enormous performance gain using mutators (using batch size = 1000 or 10000 or 100K et cetera).
My project involved loading close to 400 million rows of social media data into HyperTable to be used for real time analytics. It took me close to 3 days to just load just 1 million row of data (code sample below). Each row is approximately 32 byte. So, in order to avoid taking 2-3 weeks to load this much data, I prepared a flat file with rows and used DATA LOAD INFILE method. Performance gain was amazing. Using this method, loading rate was 368336 cells/sec.
See below for actual snapshot of action:
hypertable> LOAD DATA INFILE "/data/tmp/users.dat" INTO TABLE users;
Loading 7,113,154,337 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 508.07 s
Avg key size: 8.92 bytes
Total cells: 218976067
Throughput: 430998.80 cells/s
Resends: 2210404
hypertable> LOAD DATA INFILE "/data/tmp/graph.dat" INTO TABLE graph;
Loading 12,693,476,187 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 1189.71 s
Avg key size: 17.48 bytes
Total cells: 437952134
Throughput: 368118.13 cells/s
Resends: 1483209
Why is performance difference between 2 method is so vast? What's the best way to enhance mutator performance. Sample mutator code is below:
my $batch_size = 1000000; # or 1000 or 10000 make no substantial difference
my $ignore_unknown_cfs = 2;
my $ht = new Hypertable::ThriftClient($master, $port);
my $ns = $ht->namespace_open($namespace);
my $users_mutator = $ht->mutator_open($ns, 'users', $ignore_unknown_cfs, 10);
my $graph_mutator = $ht->mutator_open($ns, 'graph', $ignore_unknown_cfs, 10);
my $keys = new Hypertable::ThriftGen::Key({ row => $row, column_family => $cf, column_qualifier => $cq });
my $cell = new Hypertable::ThriftGen::Cell({key => $keys, value => $val});
$ht->mutator_set_cell($mutator, $cell);
$ht->mutator_flush($mutator);
I would appreciate any input on this? I don't have tremendous amount of HyperTable experience.
Thanks.

If it's taking three days to load one million rows, then you're probably calling flush() after every row insert, which is not the right thing to do. Before I describe hot to fix that, your mutator_open() arguments aren't quite right. You don't need to specify ignore_unknown_cfs and you should supply 0 for the flush_interval, something like this:
my $users_mutator = $ht->mutator_open($ns, 'users', 0, 0);
my $graph_mutator = $ht->mutator_open($ns, 'graph', 0, 0);
You should only call mutator_flush() if you would like to checkpoint how much of the input data has been consumed. A successful call to mutator_flush() means that all data that has been inserted on that mutator has durably made it into the database. If you're not checkpointing how much of the input data has been consumed, then there is no need to call mutator_flush(), since it will get flushed automatically when you close the mutator.
The next performance problem with your code that I see is that you're using mutator_set_cell(). You should use either mutator_set_cells() or mutator_set_cells_as_arrays() since each method call is a round-trip to the ThriftBroker, which is expensive. By using the mutator_set_cells_* methods, you amortize that round-trip over many cells. The mutator_set_cells_as_arrays() method can be more efficient for languages where object construction overhead is large in comparison to native datatypes (e.g. string). I'm not sure about Perl, but you might want to give that a try to see if it boosts performance.
Also, be sure to call mutator_close() when you're finished with the mutator.

Related

tmssoftware TTMSFNCGrid slow data loading

Delphi 10.4.2, TTMSFNCGrid ver. 1.0.5.16
I am downloading about 30,000 records from the database into a json object. This takes about 1 minute.
I then try to enter (for a loop) the data into TTMSFNCGrid which has about 30,000 records and 16 columns. The data entry takes 20 minutes ! This is how long it takes to render and populate the grid. How can I speed up this process?
I use something like this
for _i:= 0 to JSON_ARRAY_DANE.Count-1 do
begin
_row:= JSON_ARRAY_DANE.Items[_i] as TJSONObject;
_grid.Cells[0,_i+1]:=_row.GetValue('c1').Value;
_grid.Cells[1,_i+1]:=_row.GetValue('c2').Value;
_grid.Cells[2,_i+1]:=_row.GetValue('c3').Value;
.
.
_grid.Cells[2,_i+1]:=_row.GetValue('c16').Value;
end
Resolved.
Need to add:
_grid.BeginUpdate;
_grid.EndUpdate;
**_grid.BeginUpdate;**
for _i:= 0 to JSON_ARRAY_DANE.Count-1 do
begin
_row:= JSON_ARRAY_DANE.Items[_i] as TJSONObject;
_grid.Cells[0,_i+1]:=_row.GetValue('c1').Value;
_grid.Cells[1,_i+1]:=_row.GetValue('c2').Value;
_grid.Cells[2,_i+1]:=_row.GetValue('c3').Value;
.
.
_grid.Cells[16,_i+1]:=_row.GetValue('c16').Value;
end;
**_grid.EndUpdate;**

Tibco Spotfire - time in seconds & milliseconds in Real, convert to a time of day

I have a list of time in a decimal format of seconds, and I know what time the series started. I would like to convert it to a time of day with the offset of the start time applied. There must be a simple way to do this that I am really missing!
Sample source data:
\Name of source file : 260521-11_58
\Recording from 26.05.2021 11:58
\Channels : 1
\Scan rate : 101 ms = 0.101 sec
\Variable 1: n1(rpm)
\Internal identifier: 63
\Information1:
\Information2:
\Information3:
\Information4:
0.00000 3722.35645
0.10100 3751.06445
0.20200 1868.33350
0.30300 1868.36487
0.40400 3722.39355
0.50500 3722.51831
0.60600 3722.50464
0.70700 3722.32446
0.80800 3722.34277
0.90900 3722.47729
1.01000 3722.74048
1.11100 3722.66650
1.21200 3722.39355
1.31300 3751.02710
1.41400 1868.27539
1.51500 3722.49097
1.61600 3750.93286
1.71700 1868.30334
1.81800 3722.29224
The Start time & date is 26.05.2021 11:58, and the LH column is elapsed time in seconds with the column name [Time] . So I just want to convert the decimal / real to a time or timespan and add the start time to it.
I have tried lots of ways that are really hacky, and ultimately flawed - the below works, but just ignores the milliseconds.
TimeSpan(0,0,0,Integer(Floor([Time])),[Time] - Integer(Floor([Time])))
The last part works to just get milli / micro seconds on its own, but not as part of the above.
Your formula isn't really ignoring the milliseconds, you are using the decimal part of your time (in seconds) as milliseconds, so the value being returned is smaller than the format mask.
You need to convert the seconds to milliseconds, so something like this should work
TimeSpan(0,0,0,Integer(Floor([Time])),([Time] - Integer(Floor([Time]))) * 1000)
To add it to the time, this would work
DateAdd(Date("26-May-2021"),TimeSpan(0,0,0,Integer([Time]),([Time] - Integer([Time])) * 1000))
You will need to set the column format to
dd-MMM-yyyy HH:mm:ss:fff

Is it possible to vectorize annotation for matplotlib?

As a part of a large QC benchmark I am creating a large number (approx 100K) of scatter plots in a single PDF using PdfPages backend. (See further down for the code)
The issue I am having is that the plotting takes too much time, see output from a custom profiling/debugging effort:
Checkpoint1: Predictions done in 1.110076904296875 millis
Checkpoint2: df created and correlations calculated in 3.108978271484375 millis
Checkpoint3: plotting and accumulating done in 231.31990432739258 millis
Cycle completed in 0.23553895950317383 secs
----------------------
Checkpoint1: Predictions done in 3.718852996826172 millis
Checkpoint2: df created and correlations calculated in 2.353191375732422 millis
Checkpoint3: plotting and accumulating done in 155.93385696411133 millis
Cycle completed in 0.16200590133666992 secs
----------------------
Checkpoint1: Predictions done in 2.920866012573242 millis
Checkpoint2: df created and correlations calculated in 1.995086669921875 millis
Checkpoint3: plotting and accumulating done in 161.8819236755371 millis
Cycle completed in 0.16679787635803223 secs
The figure for plotting gets an 2-3x increase if I annotate the points, which is necessary for the use case. As you can see below I have tried both itertuples() and apply(), switching to apply did not give a significant change in the times as far as I can see.
def annotate(row, ax):
ax.annotate(row.name, (row.exp, row.model),
xytext=(10, 20), textcoords='offset points',
arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
family='sans-serif', fontsize=8, color='darkslategrey')
def plot2File(df, file, seq, z, p, s):
""" Plot predictions vs experimental """
plttitle = f"Correlations for {seq}+{z} \n pearson={p} \n spearman={s}"
ax = df.plot(x='exp', y='model', kind='scatter', title=plttitle, s=40)
df.apply(annotate, ax=ax, axis=1)
# for row in df.itertuples():
# ax.annotate(row.Index, (row.exp, row.model),
# xytext=(10, 20), textcoords='offset points',
# arrowprops=dict(arrowstyle="-", connectionstyle="arc,angleA=180,armA=10"),
# family='sans-serif', fontsize=8, color='darkslategrey')
plt.savefig(file, bbox_inches='tight', format='pdf')
plt.close()
Given the nice explanation by Jeff on a question regarding iterrows() I was wondering if it would be possible to vectorize the annotation process? Or should I ditch using a data frame altogether?

Improve postgreql insert performance when compared to Oracle. Low memory utilization by Postgresql threads

I am trying to improve the performance of Postgresql (version 9.2.3) inserts for a simple table with 1 bigint, 1 varchar, 1 float and 2 time stamps.
A simple replication of my JDBC program is attached. Here are the important points I want to mention:
I am running this program on the same system which hosts the PostgreSQL DB. (64 GB RAM and 8 CPUs.)
I am using INSERT statements AND I DO NOT want to use COPY statement. I have read and understand the COPY performs better but I am tuning the insert performance here.
I am using PreparedStatement.addbatch() and executeBatch() to insert in batches of 1000's
The performance of the insert scales well when I increase the batch size but flattens out at around a batch size of 8000. What I notice is that the postgresql thread on the system is CPU saturated as observed by the "top" command. The CPU usage of the postgres thread steadily increases and tops out at 95% when the batch size reaches 8K. The other interesting thing I notice is that it is using only up to 200MB of RAM per thread.
In comparison an Oracle DB scales much better and the the same number of insets with comparable batch sizes finish 3 to 4 times faster. I logged on to the Oracle DB machine (Sun Solaris machine) and noticed that the CPU utilization peaks out at a much bigger batch size and also each Oracle thread is using 6 to 8 GB of memory.
Given that I have memory to spare is there a way to increase the memory usage for a postgres thread for better performance?
Here are my current postgresql settings:
temp_buffers = 256MB
bgwriter_delay = 100ms
bgwriter_lru_maxpages = 1000
bgwriter_lru_multiplier = 4
maintenance_work_mem = 2GB
shared_buffers = 8GB
vacuum_cost_limit = 800
work_mem = 2GB
max_connections = 100
checkpoint_completion_target = 0.9
checkpoint_segments = 32
checkpoint_timeout =10min
checkpoint_warning =1min
wal_buffers = 32MB
wal_level = archive
cpu_tuple_cost = 0.03
effective_cache_size = 48GB
random_page_cost = 2
autovacuum = on
autovacuum_vacuum_cost_delay = 10ms
autovacuum_max_workers = 6
autovacuum_naptime = 5
autovacuum_vacuum_threshold = 100
autovacuum_analyze_threshold = 100
autovacuum_vacuum_scale_factor = 0.2
autovacuum_analyze_scale_factor = 0.1
autovacuum_vacuum_cost_limit = -1
Here are the measurements:
Time to insert 2 million rows in postgreql.
batch size - Execute batch time (sec)
1K - 73
2K - 64
4K - 60
8K - 59
10K - 59
20K - 59
40K - 59
Time to insert 4 million rows in Oracle.
batch size - Execute batch time (sec)
1K - 14
2K - 12
4K - 10
8K - 8.9
10K - 8.4
As you can see Oracle is inserting a 4 million row table much faster than Postgresql.
Here is the snippet of the program I am using for insertion.
stmt.executeUpdate("CREATE TABLE "
+ tableName
+ " (P_PARTKEY bigint not null, "
+ " P_NAME varchar(55) not null, "
+ " P_RETAILPRICE float not null, "
+ " P_TIMESTAMP Timestamp not null, "
+ " P_TS2 Timestamp not null)");
PreparedStatement pstmt = conn.prepareStatement("INSERT INTO " + tableName + " VALUES (?, ?, ?, ?, ? )");
for (int i = start; i <= end; i++) {
pstmt.setInt(1, i);
pstmt.setString(2, "Magic Maker " + i);
pstmt.setFloat(3, i);
pstmt.setTimestamp(4, new Timestamp(1273017600000L));
pstmt.setTimestamp(5, new Timestamp(1273017600000L));
pstmt.addBatch();
if (i % batchSize == 0) {
pstmt.executeBatch();
}
}
autovacuum_analyze_scale_factor = 0.002
autovacuum_vacuum_scale_factor = 0.001
You might need to change the above parameters
Specifies a fraction of the table size to add to autovacuum_analyze_threshold when deciding whether to trigger an ANALYZE. The default is 0.1 (10% of table size). In our case we have lowered that to 0.002 to make it more aggressive.
Specifies a fraction of the table size to add to autovacuum_vacuum_threshold when deciding whether to trigger a VACUUM. The default is 0.2 (20% of table size).

Ehcache miss counts

These are the statistics for my Ehcache.
I have it configured to use only memory (no persistence, no overflow to disk).
cacheHits = 50
onDiskHits = 0
offHeapHits = 0
inMemoryHits = 50
misses = 1194
onDiskMisses = 0
offHeapMisses = 0
inMemoryMisses = 1138
size = 69
averageGetTime = 0.061597
evictionCount = 0
As you can see, misses is higher than onDiskMisses + offHeapMisses + inMemoryMisses. I do have statistics strategy set to best effort:
cache.setStatisticsAccuracy(Statistics.STATISTICS_ACCURACY_BEST_EFFORT)
But the hits add up, and the difference between misses is rather large. Is there a reason why the misses do not add up correctly?
This quesiton is similar to Ehcache misses count and hitrate statistics, but the answer attributes the differences to the multiple tiers. There is only one tier here.
I'm almost certain that you're seeing this because inMemoryMisses does not include misses due to expired elements. On a get if the value is stored, but expired then you will not see an inMemoryMiss recorded, but you will see a cache miss.

Resources