Cassandra slow performance on AWS

Cassandra slow performance on AWS - performance

One of our DBAs has benchmarked Cassandra to Oracle on AWS EC2 for INSERT performance (1M records) using the same Python code (below), and got the following surprising results:
Oracle 12.2, Single node, 64cores/256GB, EC2 EBS storage, 38 sec
Cassandra 5.1.13 (DDAC), Single node, 2cores/4GB, EC2 EBS storage, 464 sec
Cassandra 3.11.4, Four nodes, 16cores/64GB(each node), EC2 EBS Storage, 486 sec
SO - What are we doing wrong?
How come Cassandra is performing so slow?
* Not enough nodes? (How come the 4 nodes is slower than single node?)
* Configuration issues?
* Something else?
Thanks!
Following is the Python code:
import logging
import time
from cassandra import ConsistencyLevel
from cassandra.cluster import Cluster, BatchStatement
from cassandra.query import SimpleStatement
from cassandra.auth import PlainTextAuthProvider
class PythonCassandraExample:
def __init__(self):
self.cluster = None
self.session = None
self.keyspace = None
self.log = None
def __del__(self):
self.cluster.shutdown()
def createsession(self):
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
self.cluster = Cluster(['10.220.151.138'],auth_provider = auth_provider)
self.session = self.cluster.connect(self.keyspace)
def getsession(self):
return self.session
# How about Adding some log info to see what went wrong
def setlogger(self):
log = logging.getLogger()
log.setLevel('INFO')
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
log.addHandler(handler)
self.log = log
# Create Keyspace based on Given Name
def createkeyspace(self, keyspace):
"""
:param keyspace: The Name of Keyspace to be created
:return:
"""
# Before we create new lets check if exiting keyspace; we will drop that and create new
rows = self.session.execute("SELECT keyspace_name FROM system_schema.keyspaces")
if keyspace in [row[0] for row in rows]:
self.log.info("dropping existing keyspace...")
self.session.execute("DROP KEYSPACE " + keyspace)
self.log.info("creating keyspace...")
self.session.execute("""
CREATE KEYSPACE %s
WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '2' }
""" % keyspace)
self.log.info("setting keyspace...")
self.session.set_keyspace(keyspace)
def create_table(self):
c_sql = """
CREATE TABLE IF NOT EXISTS employee (emp_id int PRIMARY KEY,
ename varchar,
sal double,
city varchar);
"""
self.session.execute(c_sql)
self.log.info("Employee Table Created !!!")
# lets do some batch insert
def insert_data(self):
i = 1
while i < 1000000:
insert_sql = self.session.prepare("INSERT INTO employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
batch = BatchStatement()
batch.add(insert_sql, (i, 'Danny', 2555, 'De-vito'))
self.session.execute(batch)
# self.log.info('Batch Insert Completed for ' + str(i))
i += 1
# def select_data(self):
# rows = self.session.execute('select count(*) from perftest.employee limit 5;')
# for row in rows:
# print(row.ename, row.sal)
def update_data(self):
pass
def delete_data(self):
pass
if __name__ == '__main__':
example1 = PythonCassandraExample()
example1.createsession()
example1.setlogger()
example1.createkeyspace('perftest')
example1.create_table()
# Populate perftest.employee table
start = time.time()
example1.insert_data()
end = time.time()
print ('Duration: ' + str(end-start) + ' sec.')
# example1.select_data()

There are multiple issues here:
for 2nd test you didn't allocate enough memory and cores for DDAC, so Cassandra got only 1Gb heap - Cassandra by default takes 1/4th of all available memory. The same is for 3rd test - it will get only 16Gb RAM for heap, you may need to bump it to higher value, like, 24Gb or even higher.
it's not clear how many IOPs you have in each test - EBS has different throughput depending on the size of the volume, and its type
You're using synchronous API to execute commands - basically you insert next item after you get confirmation that previous is inserted. The best throughput could be achieved by using asynchronous API;
You're preparing your statement in every iteration - this lead to sending CQL string to server each time, so it's slows down everything - just move line insert_sql = self.session.prepare( out of the loop;
(not completely related) You're using batch statements to write data - it's anti-pattern in Cassandra, as data is sent only to one node, that then should distribute data to nodes that really own the data. This explains why 4 nodes cluster is worse than 1 node cluster.
P.S. realistic load testing is quite hard. There are specialized tools for this, you can find, for example, more information in this blog post.

The updated code below will batch every 100 records:
"""
Python by Techfossguru
Copyright (C) 2017 Satish Prasad
"""
import logging
import warnings
import time
from cassandra import ConsistencyLevel
from cassandra.cluster import Cluster, BatchStatement
from cassandra.query import SimpleStatement
from cassandra.auth import PlainTextAuthProvider
class PythonCassandraExample:
def __init__(self):
self.cluster = None
self.session = None
self.keyspace = None
self.log = None
def __del__(self):
self.cluster.shutdown()
def createsession(self):
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
self.cluster = Cluster(['10.220.151.138'],auth_provider = auth_provider)
self.session = self.cluster.connect(self.keyspace)
def getsession(self):
return self.session
# How about Adding some log info to see what went wrong
def setlogger(self):
log = logging.getLogger()
log.setLevel('INFO')
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
log.addHandler(handler)
self.log = log
# Create Keyspace based on Given Name
def createkeyspace(self, keyspace):
"""
:param keyspace: The Name of Keyspace to be created
:return:
"""
# Before we create new lets check if exiting keyspace; we will drop that and create new
rows = self.session.execute("SELECT keyspace_name FROM system_schema.keyspaces")
if keyspace in [row[0] for row in rows]:
self.log.info("dropping existing keyspace...")
self.session.execute("DROP KEYSPACE " + keyspace)
self.log.info("creating keyspace...")
self.session.execute("""
CREATE KEYSPACE %s
WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '2' }
""" % keyspace)
self.log.info("setting keyspace...")
self.session.set_keyspace(keyspace)
def create_table(self):
c_sql = """
CREATE TABLE IF NOT EXISTS employee (emp_id int PRIMARY KEY,
ename varchar,
sal double,
city varchar);
"""
self.session.execute(c_sql)
self.log.info("Employee Table Created !!!")
# lets do some batch insert
def insert_data(self):
i = 1
insert_sql = self.session.prepare("INSERT INTO employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
batch = BatchStatement()
warnings.filterwarnings("ignore", category=FutureWarning)
while i < 1000001:
# insert_sql = self.session.prepare("INSERT INTO employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
# batch = BatchStatement()
batch.add(insert_sql, (i, 'Danny', 2555, 'De-vito'))
# Commit every 100 records
if (i % 100 == 0):
self.session.execute(batch)
batch = BatchStatement()
# self.log.info('Batch Insert Completed for ' + str(i))
i += 1
self.session.execute(batch)
# def select_data(self):
# rows = self.session.execute('select count(*) from actimize.employee limit 5;')
# for row in rows:
# print(row.ename, row.sal)
def update_data(self):
pass
def delete_data(self):
pass
if __name__ == '__main__':
example1 = PythonCassandraExample()
example1.createsession()
example1.setlogger()
example1.createkeyspace('actimize')
example1.create_table()
# Populate actimize.employee table
start = time.time()
example1.insert_data()
end = time.time()
print ('Duration: ' + str(end-start) + ' sec.')
# example1.select_data()

Related

How to add a maximum travel time duration for the sum of all routes in VRP Google OR-TOOLS

I am new to programming and used Google OR-tools to create my VRP model. In my current model, I have included a general time window and capacity constraint per vehicle, creating a capacitated vehicle routing problem with time windows. I followed the OR-tools guides which contains a maximum travel duration for each vehicle.
However, I want to include a maximum travel duration for the sum of all routes, whereas the maximum travel duration for each vehicle does not matter (so I set it to 100.000). Accorddingly, I want to create something in the model/solution printer that tells me which amount of addresses could not be visited due to the constraint on the maximum travel duration for the sum of all routes. From the examples I have seen I think it would be kind of easy, but my knowledge on programming is fairly limited, so my attempts had no succes. Can anyone help me?
import pandas as pd
import openpyxl
import numpy as np
import math
from random import sample
from ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import pywrapcp
from scipy.spatial.distance import squareform, pdist
from haversine import haversine
#STEP - create data
# import/read excel file
data = pd.read_excel(r'C:\Users\Jean-Paul\Documents\Thesis\OR TOOLS\Data.xlsx', engine = 'openpyxl')
df = pd.DataFrame(data, columns= ['number','lat','lng']) # create dataframe with 10805 addresses + address of the depot
#print (df)
# randomly sample X addresses from the dataframe and their corresponding number/latitude/longtitude
df_sample = df.sample(n=100)
#print (df_data)
# read first row of the excel file (= coordinates of the depot)
df_depot = pd.DataFrame(data, columns= ['number','lat','lng']).iloc[0:1]
#print (df_depot)
# combine dataframe of depot and sample into one dataframe
df_data = pd.concat([df_depot, df_sample], ignore_index=True, sort=False)
#print (df_data)
#STEP - create distance matrix data
# determine distance between latitude and longtitude
df_data.set_index('number', inplace=True)
matrix_distance = pd.DataFrame(squareform(pdist(df_data, metric=haversine)), index=df_data.index, columns=df_data.index)
matrix_list = np.array(matrix_distance)
#print (matrix_distance) # create table of distances between addresses including headers
#print (matrix_list) # converting table to list of lists and exclude headers
#STEP - create time matrix data
travel_time = matrix_list / 15 * 60 # divide distance by travel speed 20 km/h and multiply by 60 minutes
#print (travel_time) # converting distance matrix to travel time matrix
#STEP - create time window data
# create list for each sample - couriers have to visit this address within 0-X minutes of time using a list of lists
window_range = []
for i in range(len(df_data)):
list = [0, 240]
window_range.append(list) # create list of list with a time window range for each address
#print (window_range)
#STEP - create demand data
# create list for each sample - all addresses demand 1 parcel except the depot
demand_range = []
for i in range(len(df_data.iloc[0:1])):
list = 0
demand_range.append(list)
for j in range(len(df_data.iloc[1:])):
list2 = 1
demand_range.append(list2)
#print (demand_range)
#STEP - create fleet size data # amount of vehicles in the fleet
fleet_size = 6
#print (fleet_size)
#STEP - create capacity data for each vehicle
fleet_capacity = []
for i in range(fleet_size): # capacity per vehicle
list = 20
fleet_capacity.append(list)
#print (fleet_capacity)
#STEP - create data model that stores all data for the problem
def create_data_model():
data = {}
data['time_matrix'] = travel_time
data['time_windows'] = window_range
data['num_vehicles'] = fleet_size
data['depot'] = 0 # index of the depot
data['demands'] = demand_range
data['vehicle_capacities'] = fleet_capacity
return data
#STEP - creating the solution printer
def print_solution(data, manager, routing, solution):
"""Prints solution on console."""
print(f'Objective: {solution.ObjectiveValue()}')
time_dimension = routing.GetDimensionOrDie('Time')
total_time = 0
for vehicle_id in range(data['num_vehicles']):
index = routing.Start(vehicle_id)
plan_output = 'Route for vehicle {}:\n'.format(vehicle_id)
while not routing.IsEnd(index):
time_var = time_dimension.CumulVar(index)
plan_output += '{0} Time({1},{2}) -> '.format(
manager.IndexToNode(index), solution.Min(time_var),
solution.Max(time_var))
index = solution.Value(routing.NextVar(index))
time_var = time_dimension.CumulVar(index)
plan_output += '{0} Time({1},{2})\n'.format(manager.IndexToNode(index),
solution.Min(time_var),
solution.Max(time_var))
plan_output += 'Time of the route: {}min\n'.format(
solution.Min(time_var))
print(plan_output)
total_time += solution.Min(time_var)
print('Total time of all routes: {}min'.format(total_time))
#STEP - create the VRP solver
def main():
# instantiate the data problem
data = create_data_model()
# create the routing index manager
manager = pywrapcp.RoutingIndexManager(len(data['time_matrix']),
data['num_vehicles'], data['depot'])
# create routing model
routing = pywrapcp.RoutingModel(manager)
#STEP - create demand callback and dimension for capacity
# create and register a transit callback
def demand_callback(from_index):
"""Returns the demand of the node."""
# convert from routing variable Index to demands NodeIndex
from_node = manager.IndexToNode(from_index)
return data['demands'][from_node]
demand_callback_index = routing.RegisterUnaryTransitCallback(
demand_callback)
routing.AddDimensionWithVehicleCapacity(
demand_callback_index,
0, # null capacity slack
data['vehicle_capacities'], # vehicle maximum capacities
True, # start cumul to zero
'Capacity')
#STEP - create time callback
# create and register a transit callback
def time_callback(from_index, to_index):
"""Returns the travel time between the two nodes."""
# convert from routing variable Index to time matrix NodeIndex
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return data['time_matrix'][from_node][to_node]
transit_callback_index = routing.RegisterTransitCallback(time_callback)
# define cost of each Arc (costs in terms of travel time)
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
# STEP - create a dimension for the travel time (TIMEWINDOW) - dimension keeps track of quantities that accumulate over a vehicles route
# add time windows constraint
time = 'Time'
routing.AddDimension(
transit_callback_index,
2, # allow waiting time (does not have an influence in this model)
100000, # maximum total route lenght in minutes per vehicle (does not have an influence because of capacity constraint)
False, # do not force start cumul to zero
time)
time_dimension = routing.GetDimensionOrDie(time)
# add time window constraints for each location except depot
for location_idx, time_window in enumerate(data['time_windows']):
if location_idx == data['depot']:
continue
index = manager.NodeToIndex(location_idx)
time_dimension.CumulVar(index).SetRange(time_window[0], time_window[1])
# add time window constraint for each vehicle start node
depot_idx = data['depot']
for vehicle_id in range(data['num_vehicles']):
index = routing.Start(vehicle_id)
time_dimension.CumulVar(index).SetRange(
data['time_windows'][depot_idx][0],
data['time_windows'][depot_idx][1])
#STEP - instantiate route start and end times to produce feasible times
for i in range(data['num_vehicles']):
routing.AddVariableMinimizedByFinalizer(
time_dimension.CumulVar(routing.Start(i)))
routing.AddVariableMinimizedByFinalizer(
time_dimension.CumulVar(routing.End(i)))
#STEP - setting default search parameters and a heuristic method for finding the first solution
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)
#STEP - solve the problem with the serach parameters and print solution
solution = routing.SolveWithParameters(search_parameters)
if solution:
print_solution(data, manager, routing, solution)
if __name__ == '__main__':
main()

See #Mizux's answer, going under-the-hood in the solver to make a summation cost over all vehicle route lengths:
https://stackoverflow.com/a/68756570/13773745

problem, backtesting with backtrader and btalib, It's not action as i tried, want to action as order buy if sma5>sma10 and sell when sma5<sma14

from btalib.indicators import sma
import pandas as pd
import backtrader as bt
import os.path #To manage paths
import sys # to find out the script name
import datetime
import matplotlib as plt
from backtrader import cerebro
from numpy import mod #for datetime object
df = pd.read_csv('C:/Users/User/Desktop/programming/dataset/coin_Bitcoin.csv',parse_dates=True, index_col='Date')
sma14 = btalib.sma(df, period = 14)
sma5 = btalib.sma(df, period=5)
class TestStrategy(bt.Strategy):
params = (
('exitbars', 5),
)
def log(self, txt, dt=None):
#Logging function fot this strategy
dt = dt or self.datas[0].datetime.date(0)
print('%s, %s' % (dt.isoformat(), txt))
def __init__(self):
# Keep a reference to the "close" line in the data[0] dataseries
self.dataclose = self.datas[0].close
# To keep track of pending orders
self.order = None
self.buyprice = None
self.buycomm = None
def notify_order(self, order):
if order.status in [order.Submitted, order.Accepted]:
# Buy/Sell order submitted/accepted to/by broker - Nothing to do
return
if order.status in [order.Completed]:
if order.isbuy():
self.log(
'BUY EXECUTED, Price: %.2f, Cost: %.2f, Comm: %.2f' %
(order.executed.price,
order.executed.value,
order.executed.comm))
self.buyprice = order.executed.price
self.buycomm = order.executed.comm
else: #sell
self.log('SELL EXECUTED, Price: %.2f, Cost: %.2f, Comm %.2f'%
(order.executed.price,
order.executed.value,
order.executed.comm))
self.bar_executed = len(self)
elif order.status in [order.Canceled, order.Margin, order.Rejected]:
self.log('Order Canceled/Margin/Reject')
# Write down: no pending order
self.order = None
# Check if an order has been completed
# Attention: broker could reject order if not enough cash
def notify_trade(self, trade):
if not trade.isclosed:
return
self.log('OPERATION PROFIT, GROSS %.2f, NET %.2f' %
(trade.pnl, trade.pnlcomm))
def next(self):
#sma = btalib.sma(df, period=30)
# Simply log the closing price of the series from the reference
self.log('Close, %.2f' % self.dataclose[0])
# Check if an order is pending ... if yes, we cannot send a 2nd one
if self.order:
return
# Check if we are in the market
#if not self.position:
# Not yet ... we MIGHT BUY if ...
if sma5[0] > sma14[0]:
# BUY, BUY, BUY!!! (with all possible default parameters)
self.log('BUY CREATE, %.2f' % self.dataclose[0])
# Keep track of the created order to avoid a 2nd order
self.order = self.buy()
else:
# Already in the market ... we might sell
if sma5[0] < sma14[0]:
# SELL, SELL, SELL!!! (with all possible default parameters)
self.log('S[enter image description here][1]ELL CREATE, %.2f' % self.dataclose[0])
self.order = self.sell()
if __name__ == '__main__':
# Create a cerebro entity
cerebro = bt.Cerebro()
# Add a strategy
cerebro.addstrategy(TestStrategy)
modpath = os.path.dirname(os.path.abspath(sys.argv[0]))
datapath = os.path.join(modpath, 'C:/programming/AlgoTrading/backtest/BTC-USD-YF.csv')
data = bt.feeds.YahooFinanceCSVData(
dataname = datapath,
fromdate = datetime.datetime(2020,5,1),
todate = datetime.datetime(2021,6,1),
reverse = False)
#Add the Data Feed to Cerebro
cerebro.adddata(data)
cerebro.broker.setcash(100000.0)
# Add a FixedSize sizer according to the stake
#cerebro.addsizer(bt.sizers.FixedSize, stake=10)
cerebro.addsizer(bt.sizers.FixedSize)
# Set the commission
cerebro.broker.setcommission(commission=0.0)
# Print out the starting conditions
print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue())
# Run over everything
cerebro.run()
#print(df(data))
# Print out the final result
print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue())
cerebro.plot()
I tried so hard to order buy when sma5>sma14 and sell at sma5<sma14 but it doesn't work
I use backtrader as backtesting library and use btalib for indicator o generate signal where "btalib.sma(df, period)"
cerebro function is what backtesting module
sometimes it's buy and sell everyday, buy today sell tomorrow

Probably you have to invert the order of your df, this was my problem when computing the RSI with btalib.
Example: df = df.iloc[::-1]

How to solve for pyodbc.ProgrammingError: The second parameter to executemany must not be empty

Hi i'm having an issue with the transfer of data from one database to another. I created a list using field in a table on a msql db, used that list to query and oracle db table (using the initial list in the where statement to filter results) I then load the query results back into the msql db.
The program runs for the first few iterations but then errors out, with the following error (
Traceback (most recent call last):
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 67, in
insertIntoMSDatabase(idString)
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 48, in insertIntoMSDatabase
mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.)
I can't seem to find and guidance online to troubleshoot this error message. I feel it may be a simple solution but I just can't get there...
# import libraries
import cx_Oracle
import pyodbc
import logging
import time
import re
import math
import numpy as np
logging.basicConfig(level=logging.DEBUG)
conn = pyodbc.connect('''Driver={SQL Server Native Client 11.0};
Server='servername';
Database='dbname';
Trusted_connection=yes;''')
b = conn.cursor()
dsn_tns = cx_Oracle.makedsn('Hostname', 'port', service_name='name')
conn1 = cx_Oracle.connect(user=r'uid', password='pwd', dsn=dsn_tns)
c = conn1.cursor()
beginTime = time.time()
bind = (b.execute('''select distinct field1
from [server].[db].[dbo].[table]'''))
print('MSQL table(s) queried, List Generated')
# formats ids for sql string
def surroundWithQuotes(id):
return "'" + re.sub(",|\s$", "", str(id)) + "'"
def insertIntoMSDatabase(idString):
osql = '''SELECT distinct field1, field2
FROM Database.Table
WHERE field2 is not null and field3 IN ({})'''.format(idString)
c.execute(osql)
claimsdata = c.fetchall()
print('Oracle table(s) queried, Data Pulled')
mycursor = conn.cursor()
sql = '''INSERT INTO [dbo].[tablename]
(
[fields1]
,[field2]
)
VALUES (?,?)'''
val = claimsdata
mycursor.executemany(sql, val)
conn.commit()
ids = []
formattedIdStrings = []
# adds all the ids found in bind to an iterable array
for row in bind:
ids.append(row[0])
# splits the ids[] array into multiple arrays < 1000 in length
batchedIds = np.array_split(ids, math.ceil(len(ids) / 1000))
# formats the value inside each batchedId to be a string
for batchedId in batchedIds:
formattedIdStrings.append(",".join(map(surroundWithQuotes, batchedId)))
# runs insert into MS database for each batch of IDs
for idString in formattedIdStrings:
insertIntoMSDatabase(idString)
print("MSQL table loaded, Data inserted into destination")
endTime = time.time()
print("Program Time Elapsed: ",endTime-beginTime)
conn.close()
conn1.close()

mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.
Before calling .executemany() you need to verify that val is not an empty list (as would be the case if .fetchall() is called on a SELECT statement that returns no rows) , e.g.,
if val:
mycursor.executemany(sql, val)

Scheduling periodic requests to multiple devices using a shared channel

I need to request data periodically from a configurable number of devices at configurable intervals (per device). All devices are connected to a shared data bus, so only one device can send data at the same time.
The devices have very little memory, so each device can only keep the data for a certain period of time before it is overwritten by the next chunk. This means I need to make sure to request data from any given device while it is still available, or else it will be lost.
I am looking for an algorithm that, given a list of devices and their respective timing properties, finds a feasible schedule in order to achieve minimal data loss.
I guess each device could be formally described using the following properties:
data_interval: time it takes for the next chunk of data to become available
max_request_interval: maximum amount of time between requests that will not cause data loss
processing_time: time it takes to send a request and fully receive the corresponding response containing the requested data
Basically, I need to make sure to request data from every device once its data is ready and not yet expired, while keeping in mind the deadlines for all other devices.
Is there some sort of algorithm for this kind of problem? I highly doubt I'm the first person to ever encounter a situation like this. Searching for existing solutions online didn't yield many useful results, mainly because scheduling algorithms are mostly used for operating systems and such, where scheduled processes can be paused and resumed at will. I can't do this in my case, however, since the process of requesting and receiving a chunk of data is atomic, i.e. it can only be performed in its entirety or not at all.

I solved this problem using non-preemptive deadline monotonic scheduling.
Here's some python code for anyone interested:
"""This module implements non-preemptive deadline monotonic scheduling (NPDMS) to compute a schedule of periodic,
non-preemptable requests to slave devices connected to a shared data bus"""
from math import gcd
from functools import reduce
from typing import List
class Slave:
def __init__(self, name: str, period: int, processing_time: int, offset=0, deadline=None):
self.name = name
self.period = int(period)
self.processing_time = int(processing_time)
self.offset = int(offset)
if self.offset >= self.period:
raise ValueError("Slave %s: offset must be < period" % name)
self.deadline = int(deadline) if deadline else self.period
if self.deadline > self.period:
raise ValueError("Slave %s: deadline must be <= period" % name)
class Request:
def __init__(self, slave: Slave, start_time: int):
self.slave = slave
self.start_time = start_time
self.end_time = start_time + slave.processing_time
self.duration = self.end_time - self.start_time
def overlaps_with(self, other: 'Request'):
min_duration = self.duration + other.duration
start = min(other.start_time, self.start_time)
end = max(other.end_time, self.end_time)
effective_duration = end - start
return effective_duration < min_duration
class Scenario:
def __init__(self, *slaves: Slave):
self.slaves = list(slaves)
self.slaves.sort(key=lambda slave: slave.deadline)
# LCM of all slave periods
self.cycle_period = reduce(lambda a, b: a * b // gcd(a, b), [slave.period for slave in slaves])
def compute_schedule(self, resolution=1) -> 'Schedule':
request_pool = []
for t in range(0, self.cycle_period, resolution):
for slave in self.slaves:
if (t - slave.offset) % slave.period == 0 and t >= slave.offset:
request_pool.append(Request(slave, t))
request_pool.reverse()
scheduled_requests = []
current_request = request_pool.pop()
t = current_request.start_time
while t < self.cycle_period:
ongoing_request = Request(current_request.slave, t)
while ongoing_request.start_time <= t < ongoing_request.end_time:
t += resolution
scheduled_requests.append(ongoing_request)
if len(request_pool):
current_request = request_pool.pop()
t = max(current_request.start_time, t)
else:
current_request = None
break
if current_request:
request_pool.append(current_request)
return Schedule(self, scheduled_requests, request_pool)
class Schedule:
def __init__(self, scenario: Scenario, requests: List[Request], unscheduled: List[Request] = None):
self.scenario = scenario
self.requests = requests
self.unscheduled_requests = unscheduled if unscheduled else []
self._utilization = 0
for slave in self.scenario.slaves:
self._utilization += float(slave.processing_time) / float(slave.period)
self._missed_deadlines_dict = {}
for slave in self.scenario.slaves:
periods = scenario.cycle_period // slave.period
missed_deadlines = []
for period in range(periods):
start = period * slave.period
end = start + slave.period
request = self._find_request(slave, start, end)
if request:
if request.start_time < (start + slave.offset) or request.end_time > start + slave.deadline:
missed_deadlines.append(request)
if missed_deadlines:
self._missed_deadlines_dict[slave] = missed_deadlines
self._overlapping_requests = []
for i in range(0, len(requests)):
if i == 0:
continue
previous_request = requests[i - 1]
current_request = requests[i]
if current_request.overlaps_with(previous_request):
self._overlapping_requests.append((current_request, previous_request))
self._incomplete_requests = []
for request in self.requests:
if request.duration < request.slave.processing_time:
self._incomplete_requests.append(request)
#property
def is_feasible(self) -> bool:
return self.utilization <= 1 \
and not self.has_missed_deadlines \
and not self.has_overlapping_requests \
and not self.has_unscheduled_requests \
and not self.has_incomplete_requests
#property
def utilization(self) -> float:
return self._utilization
#property
def has_missed_deadlines(self) -> bool:
return len(self._missed_deadlines_dict) > 0
#property
def has_overlapping_requests(self) -> bool:
return len(self._overlapping_requests) > 0
#property
def has_unscheduled_requests(self) -> bool:
return len(self.unscheduled_requests) > 0
#property
def has_incomplete_requests(self) -> bool:
return len(self._incomplete_requests) > 0
def _find_request(self, slave, start, end) -> [Request, None]:
for r in self.requests:
if r.slave == slave and r.start_time >= start and r.end_time < end:
return r
return None
def read_scenario(file) -> Scenario:
from csv import DictReader
return Scenario(*[Slave(**row) for row in DictReader(file)])
def write_schedule(schedule: Schedule, file):
from csv import DictWriter
writer = DictWriter(file, fieldnames=["name", "start", "end"])
writer.writeheader()
for request in schedule.requests:
writer.writerow({"name": request.slave.name, "start": request.start_time, "end": request.end_time})
if __name__ == '__main__':
import argparse
import sys
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter,
description='Use non-preemptive deadline monotonic scheduling (NPDMS) to\n'
'compute a schedule of periodic, non-preemptable requests to\n'
'slave devices connected to a shared data bus.\n\n'
'Prints the computed schedule to stdout as CSV. Returns with\n'
'exit code 0 if the schedule is feasible, else 1.')
parser.add_argument("csv_file", metavar="SCENARIO", type=str,
help="A csv file describing the scenario, i.e. a list\n"
"of slave devices with the following properties:\n"
"* name: name/id of the slave device\n\n"
"* period: duration of the period of time during\n"
" which requests must be dispatched\n\n"
"* processing_time: amount of time it takes to\n"
" fully process a request (worst-case)\n\n"
"* offset: offset for initial phase-shifting\n"
" (default: 0)\n\n"
"* deadline: amount of time during which data is\n"
" available after the start of each period\n"
" (default: <period>)")
parser.add_argument("-r", "--resolution", type=int, default=1,
help="The resolution used to simulate the passage of time (default: 1)")
args = parser.parse_args()
with open(args.csv_file, 'r') as f:
schedule = read_scenario(f).compute_schedule(args.resolution)
write_schedule(schedule, sys.stdout)
exit(0 if schedule.is_feasible else 1)

Performance: how to insert CLOB fast using cx_Oracle and executemany()?

cx_Oracle API was very fast for me until I tried to work with CLOB values.
I do it as follows:
import time
import cx_Oracle
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.prepare("insert into table_clob (msg_id, message) values (:msg_id, :msg)")
cur.bindarraysize = num_records
msg_arr = cur.var(cx_Oracle.CLOB, arraysize=num_records)
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
for id in range(num_records):
msg_arr.setvalue(id, text)
rows.append( (id, msg_arr) ) # ???
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
cur.executemany(None, rows)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
The main problem worrying me is performance:
100 records prepared, 25.090 s - Very much for copying 100MB in memory!
100 records inserted, 23.503 s - Seems to be too much for 100MB over network.
The problematic step is msg_arr.setvalue(id, text). If I comment it, script takes just milliseconds to complete (inserting null into CLOB column of course).
Secondly, it seems to be weird to add the same reference to CLOB variable in rows array. I found this example in internet, and it works correctly but do I do it right?
Are there ways to improve performance in my case?
UPDATE: Tested network throughput: a 107 MB file copies in 11 s via SMB to the same host. But again, network transfer is not the main problem. Data preparation takes abnormally much time.

Weird workaround (thanks to Avinash Nandakumar from cx_Oracle mailing list) but it's a real way to greatly improve performance when inserting CLOBs:
import time
import cx_Oracle
import sys
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.bindarraysize = num_records
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
cur.executemany(
"insert into table_clob (msg_id, message) values (:msg_id, empty_clob())",
[(i,) for i in range(1, 101)])
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
selstmt = "select message from table_clob " +
"where msg_id between 1 and :1 for update"
cur.execute(selstmt, [num_records])
for id in range(num_records):
results = cur.fetchone()
results[0].write(text)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
Semantically this is not exactly the same as in my original post, I wanted to keep example as easy as possible to show the principle. The point is that you should insert emptyclob(), then select it and write its contents.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio