Performance: how to insert CLOB fast using cx_Oracle and executemany()?

Performance: how to insert CLOB fast using cx_Oracle and executemany()? - performance

cx_Oracle API was very fast for me until I tried to work with CLOB values.
I do it as follows:
import time
import cx_Oracle
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.prepare("insert into table_clob (msg_id, message) values (:msg_id, :msg)")
cur.bindarraysize = num_records
msg_arr = cur.var(cx_Oracle.CLOB, arraysize=num_records)
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
for id in range(num_records):
msg_arr.setvalue(id, text)
rows.append( (id, msg_arr) ) # ???
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
cur.executemany(None, rows)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
The main problem worrying me is performance:
100 records prepared, 25.090 s - Very much for copying 100MB in memory!
100 records inserted, 23.503 s - Seems to be too much for 100MB over network.
The problematic step is msg_arr.setvalue(id, text). If I comment it, script takes just milliseconds to complete (inserting null into CLOB column of course).
Secondly, it seems to be weird to add the same reference to CLOB variable in rows array. I found this example in internet, and it works correctly but do I do it right?
Are there ways to improve performance in my case?
UPDATE: Tested network throughput: a 107 MB file copies in 11 s via SMB to the same host. But again, network transfer is not the main problem. Data preparation takes abnormally much time.

Weird workaround (thanks to Avinash Nandakumar from cx_Oracle mailing list) but it's a real way to greatly improve performance when inserting CLOBs:
import time
import cx_Oracle
import sys
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.bindarraysize = num_records
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
cur.executemany(
"insert into table_clob (msg_id, message) values (:msg_id, empty_clob())",
[(i,) for i in range(1, 101)])
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
selstmt = "select message from table_clob " +
"where msg_id between 1 and :1 for update"
cur.execute(selstmt, [num_records])
for id in range(num_records):
results = cur.fetchone()
results[0].write(text)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
Semantically this is not exactly the same as in my original post, I wanted to keep example as easy as possible to show the principle. The point is that you should insert emptyclob(), then select it and write its contents.

Related

Improve code result speed by multiprocessing

I'm self study of Python and it's my first code.
I'm working for analyze logs from the servers. Usually I need analyze full day logs. I created script (this is example, simple logic) just for check speed. If I use normal coding the duration of analyzing 20mil rows about 12-13 minutes. I need 200mil rows by 5 min.
What I tried:
Use multiprocessing (met issue with share memory, think that fix it). But as the result - 300K rows = 20 sec and no matter how many processes. (PS: Also need control processors count in advance)
Use threading (I found that it's not give any speed, 300K rows = 2 sec. But normal code same, 300K = 2 sec)
Use asyncio (I think that script is slow because need reads many files). Result same as threading - 300K = 2 sec.
Finally I think that all three my script incorrect and didn't work correctly.
PS: I try to avoid use specific python modules (like pandas) because in this case it will be more difficult to execute on different servers. Better to use common lib.
Please help to check 1st - multiprocessing.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a, n):
proc_num = os.getpid()
a_temp_m = a["vod_miss"]
a_temp_h = a["vod_hit"]
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m[n] = a_temp_m[n] + 1
elif j[3].find('HIT') != -1:
a_temp_h[n] = a_temp_h[n] + 1
a["vod_miss"][n] = a_temp_m[n]
a["vod_hit"][n] = a_temp_h[n]
if __name__ == '__main__':
procs = []
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
n = 1
vod_live_cuts[i] = manager.list([0] * cpu)
vod_live_cuts[ii] = manager.list([0] * cpu)
for m in file:
proc = Process(target=argument, args=(m, vod_live_cuts, (n-1)))
procs.append(proc)
proc.start()
if n >= cpu:
n = 1
proc.join()
else:
n += 1
[proc.join() for proc in procs]
[proc.close() for proc in procs]
I'm expect, each file by def argument will be processed by independent process and finally all results will be saved in dict vod_live_cuts. For each process I added independent list in dict. I think it will help cross operation for use this parameter. But maybe it's wrong way :(

using IPC is costly, so only use "shared objects" for saving the final result, not for intermediate results while parsing the file.
limiting the number of processes is done by using a multiprocessing.Pool, the following code uses it to reach the max hard-disk speed, you only need to post-process the results.
you can only parse data as fast as your HDD can read it (typically 30-80 MB/s), so if you need to improve the performance further you should use SSD or RAID0 for higher disk speed, you cannot get much faster than this without changing your hardware.
import csv
import os
from multiprocessing import Process, Queue, Value, Manager, Pool
file = {"hcs.log", "hcs1.log", "hcs2.log", "hcs3.log"}
def argument(m, a):
proc_num = os.getpid()
a_temp_m_n = 0 # make it local to process
a_temp_h_n = 0 # as shared lists use IPC
with open(os.getcwd() + '/' + m, newline='') as hcs_1:
hcs_2 = csv.reader(hcs_1, delimiter=' ')
for j in hcs_2:
if j[3].find('MISS') != -1:
a_temp_m_n = a_temp_m_n + 1
elif j[3].find('HIT') != -1:
a_temp_h_n = a_temp_h_n + 1
a["vod_miss"].append(a_temp_m_n)
a["vod_hit"].append(a_temp_h_n)
if __name__ == '__main__':
manager = Manager()
vod_live_cuts = manager.dict()
i = "vod_hit"
ii = "vod_miss"
cpu = 1
vod_live_cuts[i] = manager.list()
vod_live_cuts[ii] = manager.list()
with Pool(cpu) as pool:
tasks = []
for m in file:
task = pool.apply_async(argument, args=(m, vod_live_cuts))
tasks.append(task)
for task in tasks:
task.get()
print(list(vod_live_cuts[i]))
print(list(vod_live_cuts[ii]))

Running Out of RAM using FilePerUserClientData

I have a problem with training using tff.simulation.FilePerUserClientData - I am quickly running out of RAM after 5-6 rounds with 10 clients per round.
The RAM usage is steadily increasing with each round.
I tried to narrow it down and realized that the issue is not the actual iterative process but the creation of the client datasets.
Simply calling create_tf_dataset_for_client(client) in a loop causes the problem.
So this is a minimal version of my code:
import tensorflow as tf
import tensorflow_federated as tff
import numpy as np
import pickle
BATCH_SIZE = 16
EPOCHS = 2
MAX_SEQUENCE_LEN = 20
NUM_ROUNDS = 100
CLIENTS_PER_ROUND = 10
def decode_fn(record_bytes):
return tf.io.parse_single_example(
record_bytes,
{"x": tf.io.FixedLenFeature([MAX_SEQUENCE_LEN], dtype=tf.string),
"y": tf.io.FixedLenFeature([MAX_SEQUENCE_LEN], dtype=tf.string)}
)
def dataset_fn(path):
return tf.data.TFRecordDataset([path]).map(decode_fn).padded_batch(BATCH_SIZE).repeat(EPOCHS)
def sample_client_data(data, client_ids, sampling_prob):
clients_total = len(client_ids)
x = np.random.uniform(size=clients_total)
sampled_ids = [client_ids[i] for i in range(clients_total) if x[i] < sampling_prob]
data = [train_data.create_tf_dataset_for_client(client) for client in sampled_ids]
return data
with open('users.pkl', 'rb') as f:
users = pickle.load(f)
train_client_ids = users["train"]
client_id_to_train_file = {i: "reddit_leaf_tf/" + i for i in train_client_ids}
train_data = tff.simulation.datasets.FilePerUserClientData(
client_ids_to_files=client_id_to_train_file,
dataset_fn=dataset_fn
)
sampling_prob = CLIENTS_PER_ROUND / len(train_client_ids)
for round_num in range(0, NUM_ROUNDS):
print('Round {r}'.format(r=round_num))
participants_data = sample_client_data(train_data, train_client_ids, sampling_prob)
print("Round Completed")
I am using tensorflow-federated 19.0.
Is there something wrong with the way I create the client datasets or is it somehow expected that the RAM from the previous round is not freed?

schmana# noticed this occurs when changing the cardinality of the CLIENTS placement (different number of client datasets) each round. This results in a cache filing up as documented in http://github.com/tensorflow/federated/issues/1215.
A workaround in the immediate term would be to call:
tff.framework.get_context_stack().current.executor_factory.clean_up_executors()
At the start or end of every round.

How to solve for pyodbc.ProgrammingError: The second parameter to executemany must not be empty

Hi i'm having an issue with the transfer of data from one database to another. I created a list using field in a table on a msql db, used that list to query and oracle db table (using the initial list in the where statement to filter results) I then load the query results back into the msql db.
The program runs for the first few iterations but then errors out, with the following error (
Traceback (most recent call last):
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 67, in
insertIntoMSDatabase(idString)
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 48, in insertIntoMSDatabase
mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.)
I can't seem to find and guidance online to troubleshoot this error message. I feel it may be a simple solution but I just can't get there...
# import libraries
import cx_Oracle
import pyodbc
import logging
import time
import re
import math
import numpy as np
logging.basicConfig(level=logging.DEBUG)
conn = pyodbc.connect('''Driver={SQL Server Native Client 11.0};
Server='servername';
Database='dbname';
Trusted_connection=yes;''')
b = conn.cursor()
dsn_tns = cx_Oracle.makedsn('Hostname', 'port', service_name='name')
conn1 = cx_Oracle.connect(user=r'uid', password='pwd', dsn=dsn_tns)
c = conn1.cursor()
beginTime = time.time()
bind = (b.execute('''select distinct field1
from [server].[db].[dbo].[table]'''))
print('MSQL table(s) queried, List Generated')
# formats ids for sql string
def surroundWithQuotes(id):
return "'" + re.sub(",|\s$", "", str(id)) + "'"
def insertIntoMSDatabase(idString):
osql = '''SELECT distinct field1, field2
FROM Database.Table
WHERE field2 is not null and field3 IN ({})'''.format(idString)
c.execute(osql)
claimsdata = c.fetchall()
print('Oracle table(s) queried, Data Pulled')
mycursor = conn.cursor()
sql = '''INSERT INTO [dbo].[tablename]
(
[fields1]
,[field2]
)
VALUES (?,?)'''
val = claimsdata
mycursor.executemany(sql, val)
conn.commit()
ids = []
formattedIdStrings = []
# adds all the ids found in bind to an iterable array
for row in bind:
ids.append(row[0])
# splits the ids[] array into multiple arrays < 1000 in length
batchedIds = np.array_split(ids, math.ceil(len(ids) / 1000))
# formats the value inside each batchedId to be a string
for batchedId in batchedIds:
formattedIdStrings.append(",".join(map(surroundWithQuotes, batchedId)))
# runs insert into MS database for each batch of IDs
for idString in formattedIdStrings:
insertIntoMSDatabase(idString)
print("MSQL table loaded, Data inserted into destination")
endTime = time.time()
print("Program Time Elapsed: ",endTime-beginTime)
conn.close()
conn1.close()

mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.
Before calling .executemany() you need to verify that val is not an empty list (as would be the case if .fetchall() is called on a SELECT statement that returns no rows) , e.g.,
if val:
mycursor.executemany(sql, val)

Cassandra slow performance on AWS

One of our DBAs has benchmarked Cassandra to Oracle on AWS EC2 for INSERT performance (1M records) using the same Python code (below), and got the following surprising results:
Oracle 12.2, Single node, 64cores/256GB, EC2 EBS storage, 38 sec
Cassandra 5.1.13 (DDAC), Single node, 2cores/4GB, EC2 EBS storage, 464 sec
Cassandra 3.11.4, Four nodes, 16cores/64GB(each node), EC2 EBS Storage, 486 sec
SO - What are we doing wrong?
How come Cassandra is performing so slow?
* Not enough nodes? (How come the 4 nodes is slower than single node?)
* Configuration issues?
* Something else?
Thanks!
Following is the Python code:
import logging
import time
from cassandra import ConsistencyLevel
from cassandra.cluster import Cluster, BatchStatement
from cassandra.query import SimpleStatement
from cassandra.auth import PlainTextAuthProvider
class PythonCassandraExample:
def __init__(self):
self.cluster = None
self.session = None
self.keyspace = None
self.log = None
def __del__(self):
self.cluster.shutdown()
def createsession(self):
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
self.cluster = Cluster(['10.220.151.138'],auth_provider = auth_provider)
self.session = self.cluster.connect(self.keyspace)
def getsession(self):
return self.session
# How about Adding some log info to see what went wrong
def setlogger(self):
log = logging.getLogger()
log.setLevel('INFO')
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
log.addHandler(handler)
self.log = log
# Create Keyspace based on Given Name
def createkeyspace(self, keyspace):
"""
:param keyspace: The Name of Keyspace to be created
:return:
"""
# Before we create new lets check if exiting keyspace; we will drop that and create new
rows = self.session.execute("SELECT keyspace_name FROM system_schema.keyspaces")
if keyspace in [row[0] for row in rows]:
self.log.info("dropping existing keyspace...")
self.session.execute("DROP KEYSPACE " + keyspace)
self.log.info("creating keyspace...")
self.session.execute("""
CREATE KEYSPACE %s
WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '2' }
""" % keyspace)
self.log.info("setting keyspace...")
self.session.set_keyspace(keyspace)
def create_table(self):
c_sql = """
CREATE TABLE IF NOT EXISTS employee (emp_id int PRIMARY KEY,
ename varchar,
sal double,
city varchar);
"""
self.session.execute(c_sql)
self.log.info("Employee Table Created !!!")
# lets do some batch insert
def insert_data(self):
i = 1
while i < 1000000:
insert_sql = self.session.prepare("INSERT INTO employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
batch = BatchStatement()
batch.add(insert_sql, (i, 'Danny', 2555, 'De-vito'))
self.session.execute(batch)
# self.log.info('Batch Insert Completed for ' + str(i))
i += 1
# def select_data(self):
# rows = self.session.execute('select count(*) from perftest.employee limit 5;')
# for row in rows:
# print(row.ename, row.sal)
def update_data(self):
pass
def delete_data(self):
pass
if __name__ == '__main__':
example1 = PythonCassandraExample()
example1.createsession()
example1.setlogger()
example1.createkeyspace('perftest')
example1.create_table()
# Populate perftest.employee table
start = time.time()
example1.insert_data()
end = time.time()
print ('Duration: ' + str(end-start) + ' sec.')
# example1.select_data()

There are multiple issues here:
for 2nd test you didn't allocate enough memory and cores for DDAC, so Cassandra got only 1Gb heap - Cassandra by default takes 1/4th of all available memory. The same is for 3rd test - it will get only 16Gb RAM for heap, you may need to bump it to higher value, like, 24Gb or even higher.
it's not clear how many IOPs you have in each test - EBS has different throughput depending on the size of the volume, and its type
You're using synchronous API to execute commands - basically you insert next item after you get confirmation that previous is inserted. The best throughput could be achieved by using asynchronous API;
You're preparing your statement in every iteration - this lead to sending CQL string to server each time, so it's slows down everything - just move line insert_sql = self.session.prepare( out of the loop;
(not completely related) You're using batch statements to write data - it's anti-pattern in Cassandra, as data is sent only to one node, that then should distribute data to nodes that really own the data. This explains why 4 nodes cluster is worse than 1 node cluster.
P.S. realistic load testing is quite hard. There are specialized tools for this, you can find, for example, more information in this blog post.

The updated code below will batch every 100 records:
"""
Python by Techfossguru
Copyright (C) 2017 Satish Prasad
"""
import logging
import warnings
import time
from cassandra import ConsistencyLevel
from cassandra.cluster import Cluster, BatchStatement
from cassandra.query import SimpleStatement
from cassandra.auth import PlainTextAuthProvider
class PythonCassandraExample:
def __init__(self):
self.cluster = None
self.session = None
self.keyspace = None
self.log = None
def __del__(self):
self.cluster.shutdown()
def createsession(self):
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
self.cluster = Cluster(['10.220.151.138'],auth_provider = auth_provider)
self.session = self.cluster.connect(self.keyspace)
def getsession(self):
return self.session
# How about Adding some log info to see what went wrong
def setlogger(self):
log = logging.getLogger()
log.setLevel('INFO')
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
log.addHandler(handler)
self.log = log
# Create Keyspace based on Given Name
def createkeyspace(self, keyspace):
"""
:param keyspace: The Name of Keyspace to be created
:return:
"""
# Before we create new lets check if exiting keyspace; we will drop that and create new
rows = self.session.execute("SELECT keyspace_name FROM system_schema.keyspaces")
if keyspace in [row[0] for row in rows]:
self.log.info("dropping existing keyspace...")
self.session.execute("DROP KEYSPACE " + keyspace)
self.log.info("creating keyspace...")
self.session.execute("""
CREATE KEYSPACE %s
WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '2' }
""" % keyspace)
self.log.info("setting keyspace...")
self.session.set_keyspace(keyspace)
def create_table(self):
c_sql = """
CREATE TABLE IF NOT EXISTS employee (emp_id int PRIMARY KEY,
ename varchar,
sal double,
city varchar);
"""
self.session.execute(c_sql)
self.log.info("Employee Table Created !!!")
# lets do some batch insert
def insert_data(self):
i = 1
insert_sql = self.session.prepare("INSERT INTO employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
batch = BatchStatement()
warnings.filterwarnings("ignore", category=FutureWarning)
while i < 1000001:
# insert_sql = self.session.prepare("INSERT INTO employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
# batch = BatchStatement()
batch.add(insert_sql, (i, 'Danny', 2555, 'De-vito'))
# Commit every 100 records
if (i % 100 == 0):
self.session.execute(batch)
batch = BatchStatement()
# self.log.info('Batch Insert Completed for ' + str(i))
i += 1
self.session.execute(batch)
# def select_data(self):
# rows = self.session.execute('select count(*) from actimize.employee limit 5;')
# for row in rows:
# print(row.ename, row.sal)
def update_data(self):
pass
def delete_data(self):
pass
if __name__ == '__main__':
example1 = PythonCassandraExample()
example1.createsession()
example1.setlogger()
example1.createkeyspace('actimize')
example1.create_table()
# Populate actimize.employee table
start = time.time()
example1.insert_data()
end = time.time()
print ('Duration: ' + str(end-start) + ' sec.')
# example1.select_data()

How to speed up the addition of a new column in pandas, based on comparisons on an existing one

I am working on a large-ish dataframe collection with some machine data in several tables. The goal is to add a column to every table which expresses the row's "class", considering its vicinity to a certain time stamp.
seconds = 1800
for i in range(len(tables)): # looping over 20 equally structured tables containing machine data
table = tables[i]
table['Class'] = 'no event'
for event in events[i].values: # looping over 20 equally structured tables containing events
event_time = event[1] # get integer time stamp
start_time = event_time - seconds
table.loc[(table.Time<=event_time) & (table.Time>=start_time), 'Class'] = 'event soon'
The event_times and the entries in table.Time are integers. The point is to assign the class "event soon" to all rows in a specific time frame before an event (the number of seconds).
The code takes quite long to run, and I am not sure what is to blame and what can be fixed. The amount of seconds does not have much impact on the runtime, so the part where the table is actually changed is probabaly working fine and it may have to do with the nested loops instead. However, I don't see how to get rid of them. Hopefully, there is a faster, more pandas way to go about adding this class column.
I am working with Python 3.6 and Pandas 0.19.2

You can use numpy broadcasting to do this vectotised instead of looping
Dummy data generation
num_tables = 5
seconds=1800
def gen_table(count):
for i in range(count):
times = [(100 + j)**2 for j in range(i, 50 + i)]
df = pd.DataFrame(data={'Time': times})
yield df
def gen_events(count, num_tables):
for i in range(num_tables):
times = [1E4 + 100 * (i + j )**2 for j in range(count)]
yield pd.DataFrame(data={'events': times})
tables = list(gen_table(num_tables)) # a list of 5 DataFrames of length 50
events = list(gen_events(5, num_tables)) # a list of 5 DataFrames of length 5
Comparison
For debugging, I added a dict of verification DataFrames. They are not needed, I just used them for debugging
verification = {}
for i, (table, event_df) in enumerate(zip(tables, events)):
event_list = event_df['events']
time_diff = event_list.values - table['Time'].values[:,np.newaxis] # This is where the magic happens
events_close = np.any( (0 < time_diff) & (time_diff < seconds), axis=1)
table['Class'] = np.where(events_close, 'event soon', 'no event')
# The stuff after this line can be deleted since it's only used for the verification
df = pd.DataFrame(data=time_diff, index=table['Time'], columns=event_list)
df['event'] = np.any((0 < time_diff) & (time_diff < seconds), axis=1)
verification[i] = df
newaxis
A good explanation on broadcasting is in Jakevdp's book
table['Time'].values[:,np.newaxis]
gives a (50,1) 2-d array
array([[10000],
[10201],
[10404],
....
[21609],
[21904],
[22201]], dtype=int64)
Verification
For the first step the verification df looks like this:
events 10000.0 10100.0 10400.0 10900.0 11600.0 event
Time
10000 0.0 100.0 400.0 900.0 1600.0 True
10201 -201.0 -101.0 199.0 699.0 1399.0 True
10404 -404.0 -304.0 -4.0 496.0 1196.0 True
10609 -609.0 -509.0 -209.0 291.0 991.0 True
10816 -816.0 -716.0 -416.0 84.0 784.0 True
11025 -1025.0 -925.0 -625.0 -125.0 575.0 True
11236 -1236.0 -1136.0 -836.0 -336.0 364.0 True
11449 -1449.0 -1349.0 -1049.0 -549.0 151.0 True
11664 -1664.0 -1564.0 -1264.0 -764.0 -64.0 False
11881 -1881.0 -1781.0 -1481.0 -981.0 -281.0 False
12100 -2100.0 -2000.0 -1700.0 -1200.0 -500.0 False
12321 -2321.0 -2221.0 -1921.0 -1421.0 -721.0 False
12544 -2544.0 -2444.0 -2144.0 -1644.0 -944.0 False
....
20449 -10449.0 -10349.0 -10049.0 -9549.0 -8849.0 False
20736 -10736.0 -10636.0 -10336.0 -9836.0 -9136.0 False
21025 -11025.0 -10925.0 -10625.0 -10125.0 -9425.0 False
21316 -11316.0 -11216.0 -10916.0 -10416.0 -9716.0 False
21609 -11609.0 -11509.0 -11209.0 -10709.0 -10009.0 False
21904 -11904.0 -11804.0 -11504.0 -11004.0 -10304.0 False
22201 -12201.0 -12101.0 -11801.0 -11301.0 -10601.0 False

Small optimizations of original answer.
You can shave a few lines and some assignments of the original algorithm
for table, event_df in zip(tables, events):
table['Class'] = 'no event'
for event_time in event_df['events']: # looping over 20 equally structured tables containing events
start_time = event_time - seconds
table.loc[table['Time'].between(start_time, event_time), 'Class'] = 'event soon'
You might shave some more if instead of the text 'no event' and 'event soon' you would just use booleans

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Performance: how to insert CLOB fast using cx_Oracle and executemany()? - performance

Related

Improve code result speed by multiprocessing

Running Out of RAM using FilePerUserClientData

How to solve for pyodbc.ProgrammingError: The second parameter to executemany must not be empty

Cassandra slow performance on AWS

How to speed up the addition of a new column in pandas, based on comparisons on an existing one

Categories

Resources