I am trying load a csv file into the oracle database and facing this error :
cx_Oracle.DatabaseError: ORA-01036: illegal variable name/number
Can you please help to understand the root cause and possible fix for this issue?
def main():
ConStr = 'UserName/PWD#END_POINT'
con = cx_Oracle.connect(ConStr)
cur = con.cursor()
with open('Persons.csv','r') as file:
read_csv = csv.reader(file,delimiter= '|')
sql = "insert into Persons (PERSONID,LASTNAME,FIRSTNAME,ADDRESS,CITY) values (:1,:2,:3,:4,:5)"
for lines in read_csv :
print(lines)
cur.executemany(sql,lines)
cur.close()
con.commit()
con.close();
My csv file looks like below :
PERSONID|LASTNAME|FIRSTNAME|ADDRESS|CITY
001|abc|def|ghi|jkl
002|opq|rst|uvw|xyz
From the Oracle documentation:
import cx_Oracle
import csv
. . .
# Predefine the memory areas to match the table definition
cursor.setinputsizes(None, 25)
# Adjust the batch size to meet your memory and performance requirements
batch_size = 10000
with open('testsp.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
sql = "insert into test (id,name) values (:1, :2)"
data = []
for line in csv_reader:
data.append((line[0], line[1]))
if len(data) % batch_size == 0:
cursor.executemany(sql, data)
data = []
if data:
cursor.executemany(sql, data)
con.commit()
Related
Hi i'm having an issue with the transfer of data from one database to another. I created a list using field in a table on a msql db, used that list to query and oracle db table (using the initial list in the where statement to filter results) I then load the query results back into the msql db.
The program runs for the first few iterations but then errors out, with the following error (
Traceback (most recent call last):
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 67, in
insertIntoMSDatabase(idString)
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 48, in insertIntoMSDatabase
mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.)
I can't seem to find and guidance online to troubleshoot this error message. I feel it may be a simple solution but I just can't get there...
# import libraries
import cx_Oracle
import pyodbc
import logging
import time
import re
import math
import numpy as np
logging.basicConfig(level=logging.DEBUG)
conn = pyodbc.connect('''Driver={SQL Server Native Client 11.0};
Server='servername';
Database='dbname';
Trusted_connection=yes;''')
b = conn.cursor()
dsn_tns = cx_Oracle.makedsn('Hostname', 'port', service_name='name')
conn1 = cx_Oracle.connect(user=r'uid', password='pwd', dsn=dsn_tns)
c = conn1.cursor()
beginTime = time.time()
bind = (b.execute('''select distinct field1
from [server].[db].[dbo].[table]'''))
print('MSQL table(s) queried, List Generated')
# formats ids for sql string
def surroundWithQuotes(id):
return "'" + re.sub(",|\s$", "", str(id)) + "'"
def insertIntoMSDatabase(idString):
osql = '''SELECT distinct field1, field2
FROM Database.Table
WHERE field2 is not null and field3 IN ({})'''.format(idString)
c.execute(osql)
claimsdata = c.fetchall()
print('Oracle table(s) queried, Data Pulled')
mycursor = conn.cursor()
sql = '''INSERT INTO [dbo].[tablename]
(
[fields1]
,[field2]
)
VALUES (?,?)'''
val = claimsdata
mycursor.executemany(sql, val)
conn.commit()
ids = []
formattedIdStrings = []
# adds all the ids found in bind to an iterable array
for row in bind:
ids.append(row[0])
# splits the ids[] array into multiple arrays < 1000 in length
batchedIds = np.array_split(ids, math.ceil(len(ids) / 1000))
# formats the value inside each batchedId to be a string
for batchedId in batchedIds:
formattedIdStrings.append(",".join(map(surroundWithQuotes, batchedId)))
# runs insert into MS database for each batch of IDs
for idString in formattedIdStrings:
insertIntoMSDatabase(idString)
print("MSQL table loaded, Data inserted into destination")
endTime = time.time()
print("Program Time Elapsed: ",endTime-beginTime)
conn.close()
conn1.close()
mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.
Before calling .executemany() you need to verify that val is not an empty list (as would be the case if .fetchall() is called on a SELECT statement that returns no rows) , e.g.,
if val:
mycursor.executemany(sql, val)
One of our DBAs has benchmarked Cassandra to Oracle on AWS EC2 for INSERT performance (1M records) using the same Python code (below), and got the following surprising results:
Oracle 12.2, Single node, 64cores/256GB, EC2 EBS storage, 38 sec
Cassandra 5.1.13 (DDAC), Single node, 2cores/4GB, EC2 EBS storage, 464 sec
Cassandra 3.11.4, Four nodes, 16cores/64GB(each node), EC2 EBS Storage, 486 sec
SO - What are we doing wrong?
How come Cassandra is performing so slow?
* Not enough nodes? (How come the 4 nodes is slower than single node?)
* Configuration issues?
* Something else?
Thanks!
Following is the Python code:
import logging
import time
from cassandra import ConsistencyLevel
from cassandra.cluster import Cluster, BatchStatement
from cassandra.query import SimpleStatement
from cassandra.auth import PlainTextAuthProvider
class PythonCassandraExample:
def __init__(self):
self.cluster = None
self.session = None
self.keyspace = None
self.log = None
def __del__(self):
self.cluster.shutdown()
def createsession(self):
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
self.cluster = Cluster(['10.220.151.138'],auth_provider = auth_provider)
self.session = self.cluster.connect(self.keyspace)
def getsession(self):
return self.session
# How about Adding some log info to see what went wrong
def setlogger(self):
log = logging.getLogger()
log.setLevel('INFO')
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
log.addHandler(handler)
self.log = log
# Create Keyspace based on Given Name
def createkeyspace(self, keyspace):
"""
:param keyspace: The Name of Keyspace to be created
:return:
"""
# Before we create new lets check if exiting keyspace; we will drop that and create new
rows = self.session.execute("SELECT keyspace_name FROM system_schema.keyspaces")
if keyspace in [row[0] for row in rows]:
self.log.info("dropping existing keyspace...")
self.session.execute("DROP KEYSPACE " + keyspace)
self.log.info("creating keyspace...")
self.session.execute("""
CREATE KEYSPACE %s
WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '2' }
""" % keyspace)
self.log.info("setting keyspace...")
self.session.set_keyspace(keyspace)
def create_table(self):
c_sql = """
CREATE TABLE IF NOT EXISTS employee (emp_id int PRIMARY KEY,
ename varchar,
sal double,
city varchar);
"""
self.session.execute(c_sql)
self.log.info("Employee Table Created !!!")
# lets do some batch insert
def insert_data(self):
i = 1
while i < 1000000:
insert_sql = self.session.prepare("INSERT INTO employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
batch = BatchStatement()
batch.add(insert_sql, (i, 'Danny', 2555, 'De-vito'))
self.session.execute(batch)
# self.log.info('Batch Insert Completed for ' + str(i))
i += 1
# def select_data(self):
# rows = self.session.execute('select count(*) from perftest.employee limit 5;')
# for row in rows:
# print(row.ename, row.sal)
def update_data(self):
pass
def delete_data(self):
pass
if __name__ == '__main__':
example1 = PythonCassandraExample()
example1.createsession()
example1.setlogger()
example1.createkeyspace('perftest')
example1.create_table()
# Populate perftest.employee table
start = time.time()
example1.insert_data()
end = time.time()
print ('Duration: ' + str(end-start) + ' sec.')
# example1.select_data()
There are multiple issues here:
for 2nd test you didn't allocate enough memory and cores for DDAC, so Cassandra got only 1Gb heap - Cassandra by default takes 1/4th of all available memory. The same is for 3rd test - it will get only 16Gb RAM for heap, you may need to bump it to higher value, like, 24Gb or even higher.
it's not clear how many IOPs you have in each test - EBS has different throughput depending on the size of the volume, and its type
You're using synchronous API to execute commands - basically you insert next item after you get confirmation that previous is inserted. The best throughput could be achieved by using asynchronous API;
You're preparing your statement in every iteration - this lead to sending CQL string to server each time, so it's slows down everything - just move line insert_sql = self.session.prepare( out of the loop;
(not completely related) You're using batch statements to write data - it's anti-pattern in Cassandra, as data is sent only to one node, that then should distribute data to nodes that really own the data. This explains why 4 nodes cluster is worse than 1 node cluster.
P.S. realistic load testing is quite hard. There are specialized tools for this, you can find, for example, more information in this blog post.
The updated code below will batch every 100 records:
"""
Python by Techfossguru
Copyright (C) 2017 Satish Prasad
"""
import logging
import warnings
import time
from cassandra import ConsistencyLevel
from cassandra.cluster import Cluster, BatchStatement
from cassandra.query import SimpleStatement
from cassandra.auth import PlainTextAuthProvider
class PythonCassandraExample:
def __init__(self):
self.cluster = None
self.session = None
self.keyspace = None
self.log = None
def __del__(self):
self.cluster.shutdown()
def createsession(self):
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
self.cluster = Cluster(['10.220.151.138'],auth_provider = auth_provider)
self.session = self.cluster.connect(self.keyspace)
def getsession(self):
return self.session
# How about Adding some log info to see what went wrong
def setlogger(self):
log = logging.getLogger()
log.setLevel('INFO')
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
log.addHandler(handler)
self.log = log
# Create Keyspace based on Given Name
def createkeyspace(self, keyspace):
"""
:param keyspace: The Name of Keyspace to be created
:return:
"""
# Before we create new lets check if exiting keyspace; we will drop that and create new
rows = self.session.execute("SELECT keyspace_name FROM system_schema.keyspaces")
if keyspace in [row[0] for row in rows]:
self.log.info("dropping existing keyspace...")
self.session.execute("DROP KEYSPACE " + keyspace)
self.log.info("creating keyspace...")
self.session.execute("""
CREATE KEYSPACE %s
WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '2' }
""" % keyspace)
self.log.info("setting keyspace...")
self.session.set_keyspace(keyspace)
def create_table(self):
c_sql = """
CREATE TABLE IF NOT EXISTS employee (emp_id int PRIMARY KEY,
ename varchar,
sal double,
city varchar);
"""
self.session.execute(c_sql)
self.log.info("Employee Table Created !!!")
# lets do some batch insert
def insert_data(self):
i = 1
insert_sql = self.session.prepare("INSERT INTO employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
batch = BatchStatement()
warnings.filterwarnings("ignore", category=FutureWarning)
while i < 1000001:
# insert_sql = self.session.prepare("INSERT INTO employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
# batch = BatchStatement()
batch.add(insert_sql, (i, 'Danny', 2555, 'De-vito'))
# Commit every 100 records
if (i % 100 == 0):
self.session.execute(batch)
batch = BatchStatement()
# self.log.info('Batch Insert Completed for ' + str(i))
i += 1
self.session.execute(batch)
# def select_data(self):
# rows = self.session.execute('select count(*) from actimize.employee limit 5;')
# for row in rows:
# print(row.ename, row.sal)
def update_data(self):
pass
def delete_data(self):
pass
if __name__ == '__main__':
example1 = PythonCassandraExample()
example1.createsession()
example1.setlogger()
example1.createkeyspace('actimize')
example1.create_table()
# Populate actimize.employee table
start = time.time()
example1.insert_data()
end = time.time()
print ('Duration: ' + str(end-start) + ' sec.')
# example1.select_data()
I'm trying to use Eel-sdk to stream data into Hive.
val sink = HiveSink(testDBName, testTableName)
.withPartitionStrategy(new DynamicPartitionStrategy)
val hiveOps:HiveOps = ...
val schema = new StructType(Vector(Field("name", StringType),Field("pk", StringType),Field("pk1",a StringType)))
hiveOps.createTable(
testDBName,
testTableName,
schema,
partitionKeys = Seq("pk", "pk1"),
dialect = ParquetHiveDialect(),
tableType = TableType.EXTERNAL_TABLE,
overwrite = true
)
val items = Seq.tabulate(100)(i => TestData(i.toString, "42", "apple"))
val ds = DataStream(items)
ds.to(sink)
Getting error: Number of partitions scanned(=32767) exceeds limit(=10000).
Number 32767 is a power of 2....but still can't figure it out what is wrong. Any idea?
Spark + Hive : Number of partitions scanned exceeds limit (=4000)
--conf "spark.sql.hive.convertMetastoreOrc=false"
--conf "spark.sql.hive.metastorePartitionPruning=false"
I am using UDF to add an hour to my given date and using it as filter condition in my hive query. But it is giving me below error.
FAILED: ParseException line 8:4 cannot recognize input near 'TRANSFORM' '(' '"201606161340"' in expression specification
My hive query is :
add file adddate.py;
select
dt,param2
from
my_table
where
dt>=TRANSFORM("201606161340") USING adddate.py and parm1="465";
My python Code is
from datetime import datetime, timedelta
import sys
import string
def add_date(current_date):
date_object = datetime.strptime(current_date, '%Y%m%d%H%M')
# print(date_object)
one_hour = date_object + timedelta(hours=1)
return one_hour.strftime('%Y%m%d%H%M')
# addDate("201606152350")
while True:
line = sys.stdin.readline()
if not line:
break
line = string.strip(line, "\n ")
print(add_date(line))
cx_Oracle API was very fast for me until I tried to work with CLOB values.
I do it as follows:
import time
import cx_Oracle
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.prepare("insert into table_clob (msg_id, message) values (:msg_id, :msg)")
cur.bindarraysize = num_records
msg_arr = cur.var(cx_Oracle.CLOB, arraysize=num_records)
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
for id in range(num_records):
msg_arr.setvalue(id, text)
rows.append( (id, msg_arr) ) # ???
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
cur.executemany(None, rows)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
The main problem worrying me is performance:
100 records prepared, 25.090 s - Very much for copying 100MB in memory!
100 records inserted, 23.503 s - Seems to be too much for 100MB over network.
The problematic step is msg_arr.setvalue(id, text). If I comment it, script takes just milliseconds to complete (inserting null into CLOB column of course).
Secondly, it seems to be weird to add the same reference to CLOB variable in rows array. I found this example in internet, and it works correctly but do I do it right?
Are there ways to improve performance in my case?
UPDATE: Tested network throughput: a 107 MB file copies in 11 s via SMB to the same host. But again, network transfer is not the main problem. Data preparation takes abnormally much time.
Weird workaround (thanks to Avinash Nandakumar from cx_Oracle mailing list) but it's a real way to greatly improve performance when inserting CLOBs:
import time
import cx_Oracle
import sys
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.bindarraysize = num_records
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
cur.executemany(
"insert into table_clob (msg_id, message) values (:msg_id, empty_clob())",
[(i,) for i in range(1, 101)])
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
selstmt = "select message from table_clob " +
"where msg_id between 1 and :1 for update"
cur.execute(selstmt, [num_records])
for id in range(num_records):
results = cur.fetchone()
results[0].write(text)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
Semantically this is not exactly the same as in my original post, I wanted to keep example as easy as possible to show the principle. The point is that you should insert emptyclob(), then select it and write its contents.