cx_Oracle equivalent of datetime64[ns] - oracle

Given the following kind of data in a column of a Pandas data frame (Pandas data type is datetime64[ns]:
Contract_EffDt
1974-02-01
I want to import this (along with some string and int columns) into Oracle like this:
import cx_Oracle as cx
.
.
.
cur.setinputsizes(25,25,255,255,25,25,25,255,255,25,25,cx.DATE,int)
cur=con.cursor()
cur.bindarraysize = 1000
cur.executemany('''insert into Table(Contract_ID,Plan_ID,Org_Type,Plan_Type,Offers_Part_D,SNP_Plan,\
EGHP,Org_Name,Org_Mkt_Name,Plan_Name,Parent_Org,Contract_EffDt,YYYYMM) values (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13)''', rows)
con.commit()
cur.close()
con.close()
How can I get cx_Oracle to recognize my datetime column for what it is and insert it accordingly? As you can see, I've tried cx.DATE but this doesn't work.
Thanks in advance!
Update with more information:
import cx_Oracle as cx
import pandas as pd
pwd=pwd
namec='table'
con = cx.Connection("table/"+pwd+"#server")
dfc=pd.read_csv(r'\\...\CPSC_Contract_Info_2016_03.csv',
encoding='latin-1')
#taken from https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Downloads/2016/March/CPSC-Enrollment-2016-03.zip
#^contract info file^
rows=dfc.values.tolist()
dfc['Contract_EffDt'] = pd.to_datetime(dfc['Contract_EffDt'])
dfc['Contract_EffDt'] = pd.to_datetime(dfc['Contract_EffDt'],format='%YYYY-%mm-%dd')
dfc['Plan_ID']=dfc['Plan_ID'].astype(str)
dfc.dtypes
Contract_ID object
Plan_ID object
Org_Type object
Plan_Type object
Offers_Part_D object
SNP_Plan object
EGHP object
Org_Name object
Org_Mkt_Name object
Plan_Name object
Parent_Org object
Contract_EffDt datetime64[ns]
YYYYMM int64
dtype: object
cur.setinputsizes(25,25,255,255,25,25,25,255,255,25,25,cx.DATETIME,int)
cur=con.cursor()
cur.bindarraysize = 1000
cur.executemany('''insert into table(Contract_ID,Plan_ID,Org_Type,Plan_Type,Offers_Part_D,SNP_Plan,\
EGHP,Org_Name,Org_Mkt_Name,Plan_Name,Parent_Org,Contract_EffDt,YYYYMM) values (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,to_date(:12,'yyyy-mm-dd'),:13)''', rows)
con.commit()
cur.close()
con.close()
Output:
TypeError Traceback (most recent call last)
<ipython-input-152-6696ee05e0f2> in <module>()
4 cur=con.cursor()
5 cur.bindarraysize = 1000
----> 6 cur.executemany('''insert into table(Contract_ID,Plan_ID,Org_Type,Plan_Type,Offers_Part_D,SNP_Plan,EGHP,Org_Name,Org_Mkt_Name,Plan_Name,Parent_Org,Contract_EffDt,YYYYMM) values (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,to_date(:12,'yyyy-mm-dd'),:13)''', rows)
7 con.commit()
8 cur.close()
TypeError: expecting numeric data

Related

Huggingface load_dataset() method how to assign the "features" argument?

I'm trying to load a custom dataset to use for finetuning a Huggingface model. My data is a csv file with 2 columns: one is 'sequence' which is a string , the other one is 'label' which is also a string, with 8 classes. I want to load my dataset and assign the type of the 'sequence' column to 'string' and the type of the 'label' column to 'ClassLabel'
my code is this:
from datasets import Features
from datasets import load_dataset
ft = Features({'sequence':'str','label':'ClassLabel'})
mydataset = load_dataset("csv", data_files="mydata.csv",features= ft)
running this code, I got the following error:
TypeError Traceback (most recent call last)
<ipython-input-59-45fedff522e8> in <module>()
7
8 mydataset = load_dataset("csv", data_files="mydata.csv",
----> 9 features= ft)
10
8 frames
/usr/local/lib/python3.7/dist-packages/datasets/features/features.py in get_nested_type(schema)
794
795 # Other objects are callable which returns their data type (ClassLabel, Array2D, Translation, Arrow datatype creation methods)
--> 796 return schema()
TypeError: 'str' object is not callable
could someone help please?
You should define your Features similarly to what is in here:
from datasets import Features, Value, ClassLabel
from datasets import load_dataset
class_names = ['class_label_1', 'class_label_2']
ft = Features({'sequence': Value('string'), 'label': ClassLabel(names=class_names)})
mydataset = load_dataset("csv", data_files="mydata.csv",features=ft)

How to solve for pyodbc.ProgrammingError: The second parameter to executemany must not be empty

Hi i'm having an issue with the transfer of data from one database to another. I created a list using field in a table on a msql db, used that list to query and oracle db table (using the initial list in the where statement to filter results) I then load the query results back into the msql db.
The program runs for the first few iterations but then errors out, with the following error (
Traceback (most recent call last):
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 67, in
insertIntoMSDatabase(idString)
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 48, in insertIntoMSDatabase
mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.)
I can't seem to find and guidance online to troubleshoot this error message. I feel it may be a simple solution but I just can't get there...
# import libraries
import cx_Oracle
import pyodbc
import logging
import time
import re
import math
import numpy as np
logging.basicConfig(level=logging.DEBUG)
conn = pyodbc.connect('''Driver={SQL Server Native Client 11.0};
Server='servername';
Database='dbname';
Trusted_connection=yes;''')
b = conn.cursor()
dsn_tns = cx_Oracle.makedsn('Hostname', 'port', service_name='name')
conn1 = cx_Oracle.connect(user=r'uid', password='pwd', dsn=dsn_tns)
c = conn1.cursor()
beginTime = time.time()
bind = (b.execute('''select distinct field1
from [server].[db].[dbo].[table]'''))
print('MSQL table(s) queried, List Generated')
# formats ids for sql string
def surroundWithQuotes(id):
return "'" + re.sub(",|\s$", "", str(id)) + "'"
def insertIntoMSDatabase(idString):
osql = '''SELECT distinct field1, field2
FROM Database.Table
WHERE field2 is not null and field3 IN ({})'''.format(idString)
c.execute(osql)
claimsdata = c.fetchall()
print('Oracle table(s) queried, Data Pulled')
mycursor = conn.cursor()
sql = '''INSERT INTO [dbo].[tablename]
(
[fields1]
,[field2]
)
VALUES (?,?)'''
val = claimsdata
mycursor.executemany(sql, val)
conn.commit()
ids = []
formattedIdStrings = []
# adds all the ids found in bind to an iterable array
for row in bind:
ids.append(row[0])
# splits the ids[] array into multiple arrays < 1000 in length
batchedIds = np.array_split(ids, math.ceil(len(ids) / 1000))
# formats the value inside each batchedId to be a string
for batchedId in batchedIds:
formattedIdStrings.append(",".join(map(surroundWithQuotes, batchedId)))
# runs insert into MS database for each batch of IDs
for idString in formattedIdStrings:
insertIntoMSDatabase(idString)
print("MSQL table loaded, Data inserted into destination")
endTime = time.time()
print("Program Time Elapsed: ",endTime-beginTime)
conn.close()
conn1.close()
mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.
Before calling .executemany() you need to verify that val is not an empty list (as would be the case if .fetchall() is called on a SELECT statement that returns no rows) , e.g.,
if val:
mycursor.executemany(sql, val)

cx_Oracle.DatabaseError: ORA-01036: illegal variable name/number

I am trying load a csv file into the oracle database and facing this error :
cx_Oracle.DatabaseError: ORA-01036: illegal variable name/number
Can you please help to understand the root cause and possible fix for this issue?
def main():
ConStr = 'UserName/PWD#END_POINT'
con = cx_Oracle.connect(ConStr)
cur = con.cursor()
with open('Persons.csv','r') as file:
read_csv = csv.reader(file,delimiter= '|')
sql = "insert into Persons (PERSONID,LASTNAME,FIRSTNAME,ADDRESS,CITY) values (:1,:2,:3,:4,:5)"
for lines in read_csv :
print(lines)
cur.executemany(sql,lines)
cur.close()
con.commit()
con.close();
My csv file looks like below :
PERSONID|LASTNAME|FIRSTNAME|ADDRESS|CITY
001|abc|def|ghi|jkl
002|opq|rst|uvw|xyz
From the Oracle documentation:
import cx_Oracle
import csv
. . .
# Predefine the memory areas to match the table definition
cursor.setinputsizes(None, 25)
# Adjust the batch size to meet your memory and performance requirements
batch_size = 10000
with open('testsp.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
sql = "insert into test (id,name) values (:1, :2)"
data = []
for line in csv_reader:
data.append((line[0], line[1]))
if len(data) % batch_size == 0:
cursor.executemany(sql, data)
data = []
if data:
cursor.executemany(sql, data)
con.commit()

pandas series sort_index() not working with kind='mergesort'

I needed a stable index sorting for DataFrames, when I had this problem:
In cases where a DataFrame becomes a Series (when only a single column matches the selection), the kind argument returns an error. See example:
import pandas as pd
df_a = pd.Series(range(10))
df_b = pd.Series(range(100, 110))
df = pd.concat([df_a, df_b])
df.sort_index(kind='mergesort')
with the following error:
----> 6 df.sort_index(kind='mergesort')
TypeError: sort_index() got an unexpected keyword argument 'kind'
If DataFrames (more then one column is selected), mergesort works ok.
EDIT:
When selecting a single column from a DataFrame for example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame(np.array(range(25)).reshape(5,5))
df_b = pd.DataFrame(np.array(range(100, 125)).reshape(5,5))
df = pd.concat([df_a, df_b])
the following returns an error:
df[0].sort_index(kind='mergesort')
...since the selection is casted to a pandas Series, and as pointed out the pandas.Series.sort_index documentation contains a bug.
However,
df[[0]].sort_index(kind='mergesort')
works alright, since its type continues to be a DataFrame.
pandas.Series.sort_index() has no kind parameter.
here is the definition of this function for Pandas 0.18.1 (file: ./pandas/core/series.py):
# line 1729
#Appender(generic._shared_docs['sort_index'] % _shared_doc_kwargs)
def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
sort_remaining=True):
axis = self._get_axis_number(axis)
index = self.index
if level is not None:
new_index, indexer = index.sortlevel(level, ascending=ascending,
sort_remaining=sort_remaining)
elif isinstance(index, MultiIndex):
from pandas.core.groupby import _lexsort_indexer
indexer = _lexsort_indexer(index.labels, orders=ascending)
indexer = com._ensure_platform_int(indexer)
new_index = index.take(indexer)
else:
new_index, indexer = index.sort_values(return_indexer=True,
ascending=ascending)
new_values = self._values.take(indexer)
result = self._constructor(new_values, index=new_index)
if inplace:
self._update_inplace(result)
else:
return result.__finalize__(self)
file ./pandas/core/generic.py, line 39
_shared_doc_kwargs = dict(axes='keywords for axes', klass='NDFrame',
axes_single_arg='int or labels for object',
args_transpose='axes to permute (int or label for'
' object)')
So most probably it's a bug in the pandas documentation...
Your df is Series, it's not a data frame

Performance: how to insert CLOB fast using cx_Oracle and executemany()?

cx_Oracle API was very fast for me until I tried to work with CLOB values.
I do it as follows:
import time
import cx_Oracle
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.prepare("insert into table_clob (msg_id, message) values (:msg_id, :msg)")
cur.bindarraysize = num_records
msg_arr = cur.var(cx_Oracle.CLOB, arraysize=num_records)
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
for id in range(num_records):
msg_arr.setvalue(id, text)
rows.append( (id, msg_arr) ) # ???
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
cur.executemany(None, rows)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
The main problem worrying me is performance:
100 records prepared, 25.090 s - Very much for copying 100MB in memory!
100 records inserted, 23.503 s - Seems to be too much for 100MB over network.
The problematic step is msg_arr.setvalue(id, text). If I comment it, script takes just milliseconds to complete (inserting null into CLOB column of course).
Secondly, it seems to be weird to add the same reference to CLOB variable in rows array. I found this example in internet, and it works correctly but do I do it right?
Are there ways to improve performance in my case?
UPDATE: Tested network throughput: a 107 MB file copies in 11 s via SMB to the same host. But again, network transfer is not the main problem. Data preparation takes abnormally much time.
Weird workaround (thanks to Avinash Nandakumar from cx_Oracle mailing list) but it's a real way to greatly improve performance when inserting CLOBs:
import time
import cx_Oracle
import sys
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.bindarraysize = num_records
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
cur.executemany(
"insert into table_clob (msg_id, message) values (:msg_id, empty_clob())",
[(i,) for i in range(1, 101)])
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
selstmt = "select message from table_clob " +
"where msg_id between 1 and :1 for update"
cur.execute(selstmt, [num_records])
for id in range(num_records):
results = cur.fetchone()
results[0].write(text)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
Semantically this is not exactly the same as in my original post, I wanted to keep example as easy as possible to show the principle. The point is that you should insert emptyclob(), then select it and write its contents.

Resources