I have an SQL query that selects from two inner joined tables. The select statement takes about 50 seconds. However, the fetchall() takes 788 seconds for only 981 results:
time0 = time.time()
self.cursor.execute("SELECT spectrum_id, feature_table_id "+
"FROM spectrum AS s "+
"INNER JOIN feature AS f "+
"ON f.msrun_msrun_id = s.msrun_msrun_id "+
"INNER JOIN (SELECT feature_feature_table_id, min(rt) AS rtMin, max(rt) AS rtMax, min(mz) AS mzMin, max(mz) as mzMax "+
"FROM convexhull GROUP BY feature_feature_table_id) AS t "+
"ON t.feature_feature_table_id = f.feature_table_id "+
"WHERE s.msrun_msrun_id = ? "+
"AND s.scan_start_time >= t.rtMin "+
"AND s.scan_start_time <= t.rtMax "+
"AND base_peak_mz >= t.mzMin "+
"AND base_peak_mz <= t.mzMax", spectrumFeature_InputValues)
print 'query took:',time.time()-time0,'seconds'
time0 = time.time()
spectrumAndFeature_ids = self.cursor.fetchall()
print time.time()-time0,'seconds since to fetchall'
Is there a reason why fetchall() takes so long? Doing:
while 1:
info = self.cursor.fetchone()
if info:
<do something>
else:
break
is just as slow as :
allInfo = self.cursor.fetchall()
for info in allInfo:
<do something>
By default fetchall() is as slow as looping over fetchone() due to the arraysize of the Cursor object being set to 1.
To speed things up you can loop over fetchmany(), but to see a performance gain, you need to provide it with a size parameter bigger than 1, otherwise it'll fetch "many" by batches of arraysize, i.e. 1.
It is quite possible that you can get the performance gain simply by raising the value of arraysize, but I have no experience doing this, so you may want to experiment with that first by doing something like:
>>> import sqlite3
>>> conn = sqlite3.connect(":memory:")
>>> cu = conn.cursor()
>>> cu.arraysize
1
>>> cu.arraysize = 10
>>> cu.arraysize
10
More on the above here: http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany
If you are running a complex query, try to create a table using that complex query and then run a select on the created table so the performance is good.
Related
The below queries hangs exactly after 5 calls every time,
tx := db.Raw("select count(*) as hash from transaction_logs left join blocks on transaction_logs.block_number = blocks.number"+
" where (transaction_logs.address = ? and transaction_logs.topic0 = ?) and blocks.total_confirmations >= 7 "+
"group by transaction_hash", strings.ToLower("0xa11265a58d9f5a2991fe8476a9afea07031ac5bf"),
"0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef").Scan(&totalIds)
If we replace it without the arguments it works
db.Raw("select count(*) as hash from transaction_logs left join blocks on transaction_logs.block_number = blocks.number"+
" where (transaction_logs.address = #tokenAddress and transaction_logs.topic0 = '0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef') and blocks.total_confirmations >= 7 "+
"group by transaction_hash", sql.Named("tokenAddress", strings.ToLower("0xa11265a58d9f5a2991fe8476a9afea07031ac5bf"))
Tried even with named parameter, same result
Can anyone help here
Perhaps I misunderstand how to get a count of rows returned by tiny_tds, which talks to ms sql server.
The following code produces -1 rows
sql = "EXEC [Arrivals] #startDate='#{#startDate}', #endDate='#{#endDate}'"
client = TinyTds::Client.new(...)
result = client.execute(sql)
result.each
p result.affected_rows (always returns -1)
This code, using a loop, counts rows correctly:
sql = "EXEC [Arrivals] #startDate='#{#startDate}', #endDate='#{#endDate}'"
client = TinyTds::Client.new(...)
result = client.execute(sql)
#no_of_arrivals = 0
result.each do |row|
#no_of_arrivals = #no_of_arrivals + 1
end
p #no_of_arrivals (returns correct count)
I did see affected_rows in action earlier today, on a table, and it worked. Could it have something to do with the SP... am I missing something obvious?
cx_Oracle API was very fast for me until I tried to work with CLOB values.
I do it as follows:
import time
import cx_Oracle
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.prepare("insert into table_clob (msg_id, message) values (:msg_id, :msg)")
cur.bindarraysize = num_records
msg_arr = cur.var(cx_Oracle.CLOB, arraysize=num_records)
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
for id in range(num_records):
msg_arr.setvalue(id, text)
rows.append( (id, msg_arr) ) # ???
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
cur.executemany(None, rows)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
The main problem worrying me is performance:
100 records prepared, 25.090 s - Very much for copying 100MB in memory!
100 records inserted, 23.503 s - Seems to be too much for 100MB over network.
The problematic step is msg_arr.setvalue(id, text). If I comment it, script takes just milliseconds to complete (inserting null into CLOB column of course).
Secondly, it seems to be weird to add the same reference to CLOB variable in rows array. I found this example in internet, and it works correctly but do I do it right?
Are there ways to improve performance in my case?
UPDATE: Tested network throughput: a 107 MB file copies in 11 s via SMB to the same host. But again, network transfer is not the main problem. Data preparation takes abnormally much time.
Weird workaround (thanks to Avinash Nandakumar from cx_Oracle mailing list) but it's a real way to greatly improve performance when inserting CLOBs:
import time
import cx_Oracle
import sys
num_records = 100
con = cx_Oracle.connect('user/password#sid')
cur = con.cursor()
cur.bindarraysize = num_records
text = '$'*2**20 # 1 MB of text
rows = []
start_time = time.perf_counter()
cur.executemany(
"insert into table_clob (msg_id, message) values (:msg_id, empty_clob())",
[(i,) for i in range(1, 101)])
print('{} records prepared, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
start_time = time.perf_counter()
selstmt = "select message from table_clob " +
"where msg_id between 1 and :1 for update"
cur.execute(selstmt, [num_records])
for id in range(num_records):
results = cur.fetchone()
results[0].write(text)
con.commit()
print('{} records inserted, {:.3f} s'
.format(num_records, time.perf_counter() - start_time))
cur.close()
con.close()
Semantically this is not exactly the same as in my original post, I wanted to keep example as easy as possible to show the principle. The point is that you should insert emptyclob(), then select it and write its contents.
I am getting the missing right paranthesis error. If I remove the comments around iterator.next() statement, its working fine. Unable to figure out whats wrong. There is NO "(" in the data I pass.
String ORACLE_SUM_QUERY = "select item_number, sum(system_quantity) from ITEMS " +
"where sndate = ? and item_id in" +
" (select item_id from ap.system_items where org_id = 4 " +
" and segment1 in ";
......
while (iterator.hasNext()) {
//iterator.next();
String oracleQuery = String.format(ORACLE_SUM_QUERY + "(%s)) GROUP BY item_number", iterator.next());
preparedStat = connection.prepareStatement(oracleQuery);
preparedStat.setDate(1, getSnDate());
The error seems to indicate that the SQL statement you are building up in oracleQuery has an incorrect number of parenthesis. It would probably be helpful to print that SQL statement out before passing it to the prepareStatement call to make debugging easier.
My guess is that the string that is returned by iterator.next() is not what you expect.
Try rewriting your code as
String ORACLE_SUM_QUERY = "select item_number, sum(system_quantity) from ITEMS " +
"where sndate = ? and item_id in" +
" (select item_id from ap.system_items where org_id = 4 " +
" and segment1 in (";
......
while (iterator.hasNext())
{
ORACLE_SUM_QUERY = ORACLE_SUM_QUERY + String.format("%s", iterator.next());
if(iterator.hasNext())
ORACLE_SUM_QUERY = ORACLE_SUM_QUERY + ",";
}
ORACLE_SUM_QUERY = ORACLE_SUM_QUERY + ")) GROUP BY item_number";
preparedStat = connection.prepareStatement(ORACLE_SUM_QUERY);
preparedStat.setDate(1, getSnDate());
This may not get it exactly as I can't test it, but it might get you closer.
Share and enjoy.
As an answer on my question: Is it normal that sqlite.fetchall() is so slow? it seems that fetch-all and fetch-one can be incredibly slow for sqlite.
As I mentioned there, I have the following query:
time0 = time.time()
self.cursor.execute("SELECT spectrum_id, feature_table_id "+
"FROM spectrum AS s "+
"INNER JOIN feature AS f "+
"ON f.msrun_msrun_id = s.msrun_msrun_id "+
"INNER JOIN (SELECT feature_feature_table_id, min(rt) AS rtMin, max(rt) AS rtMax, min(mz) AS mzMin, max(mz) as mzMax "+
"FROM convexhull GROUP BY feature_feature_table_id) AS t "+
"ON t.feature_feature_table_id = f.feature_table_id "+
"WHERE s.msrun_msrun_id = ? "+
"AND s.scan_start_time >= t.rtMin "+
"AND s.scan_start_time <= t.rtMax "+
"AND base_peak_mz >= t.mzMin "+
"AND base_peak_mz <= t.mzMax", spectrumFeature_InputValues)
print 'query took:',time.time()-time0,'seconds'
time0 = time.time()
spectrumAndFeature_ids = self.cursor.fetchall()
print time.time()-time0,'seconds since to fetchall'
The execution of the select statement takes about 50 seconds (acceptable). However, the fetchall() takes 788 seconds, only fetching 981 results.
The way proposed to speed up the query given as answer to my question: Is it normal that sqlite.fetchall() is so slow? using fetchmany(), has not improved the speed of fetching the results.
How can I speed up fetching the results after running an sqlite query?
The sql exactly as I tried to execute it on command line:
sqlite> SELECT spectrum_id, feature_table_id
...> FROM spectrum AS s
...> INNER JOIN feature AS f
...> ON f.msrun_msrun_id = s.msrun_msrun_id
...> INNER JOIN (SELECT feature_feature_table_id, min(rt) AS rtMin, max(rt) AS rtMax, min(mz) AS mzMin, max(mz) as mzMax
...> FROM convexhull GROUP BY feature_feature_table_id) AS t
...> ON t.feature_feature_table_id = f.feature_table_id
...> WHERE s.msrun_msrun_id = 1
...> AND s.scan_start_time >= t.rtMin
...> AND s.scan_start_time <= t.rtMax
...> AND base_peak_mz >= t.mzMin
...> AND base_peak_mz <= t.mzMax;
update:
So I started running the query on the commandline about 45 minutes ago, and it's still busy, so it's also very slow using the commandline.
From reading this question, it sounds like you could benefit from using the APSW sqlite module. Somehow you may be victim of your sqlite module causing your query to be executed in some less performant manner.
I was curious so I tried using apsw myself. It wasn't too complicated. Why don't you give it a try?
To install it I had to:
Extract the latest version.
Have the installation package fetch the latest sqlite amalgamation.
python setup.py fetch --sqlite
Build and install.
sudo python setup.py install
Use it in place of the other sqlite module.
import apsw
<...>
conn = apsw.Connection('foo.db')