Is there any way to view the physical SQLs executed by Calcite JDBC? - jdbc

Recently I am studying Apache Calcite, by now I can use explain plan for via JDBC to view the logical plan, and I am wondering how can I view the physical sql in the plan execution? Since there may be bugs in the physical sql generation so I need to make sure the correctness.
val connection = DriverManager.getConnection("jdbc:calcite:")
val calciteConnection = connection.asInstanceOf[CalciteConnection]
val rootSchema = calciteConnection.getRootSchema()
val dsInsightUser = JdbcSchema.dataSource("jdbc:mysql://localhost:13306/insight?useSSL=false&serverTimezone=UTC", "com.mysql.jdbc.Driver", "insight_admin","xxxxxx")
val dsPerm = JdbcSchema.dataSource("jdbc:mysql://localhost:13307/permission?useSSL=false&serverTimezone=UTC", "com.mysql.jdbc.Driver", "perm_admin", "xxxxxx")
rootSchema.add("insight_user", JdbcSchema.create(rootSchema, "insight_user", dsInsightUser, null, null))
rootSchema.add("perm", JdbcSchema.create(rootSchema, "perm", dsPerm, null, null))
val stmt = connection.createStatement()
val rs = stmt.executeQuery("""explain plan for select "perm"."user_table".* from "perm"."user_table" join "insight_user"."user_tab" on "perm"."user_table"."id"="insight_user"."user_tab"."id" """)
val metaData = rs.getMetaData()
while(rs.next()) {
for(i <- 1 to metaData.getColumnCount) printf("%s ", rs.getObject(i))
println()
}
result is
EnumerableCalc(expr#0..3=[{inputs}], proj#0..2=[{exprs}])
EnumerableHashJoin(condition=[=($0, $3)], joinType=[inner])
JdbcToEnumerableConverter
JdbcTableScan(table=[[perm, user_table]])
JdbcToEnumerableConverter
JdbcProject(id=[$0])
JdbcTableScan(table=[[insight_user, user_tab]])

There is a Calcite Hook, Hook.QUERY_PLAN that is triggered with the JDBC query strings. From the source:
/** Called with a query that has been generated to send to a back-end system.
* The query might be a SQL string (for the JDBC adapter), a list of Mongo
* pipeline expressions (for the MongoDB adapter), et cetera. */
QUERY_PLAN;
You can register a listener to log any query strings, like this in Java:
Hook.QUERY_PLAN.add((Consumer<String>) s -> LOG.info("Query sent over JDBC:\n" + s));

It is possible to see the generated SQL query by setting calcite.debug=true system property. The exact place where this is happening is in JdbcToEnumerableConverter. As this is happening during the execution of the query you will have to remove the "explain plan for"
from stmt.executeQuery.
Note that by setting debug mode to true you will get a lot of other messages as well as other information regarding generated code.

Related

Very slow connection to Snowflake from Databricks

I am trying to connect to Snowflake using R in databricks, my connection works and I can make queries and retrieve data successfully, however my problem is that it can take more than 25 minutes to simply connect, but once connected all my queries are quick thereafter.
I am using the sparklyr function 'spark_read_source', which looks like this:
query<- spark_read_source(
sc = sc,
name = "query_tbl",
memory = FALSE,
overwrite = TRUE,
source = "snowflake",
options = append(sf_options, client_Q)
)
where 'sf_options' are a list of connection parameters which look similar to this;
sf_options <- list(
sfUrl = "https://<my_account>.snowflakecomputing.com",
sfUser = "<my_user>",
sfPassword = "<my_pass>",
sfDatabase = "<my_database>",
sfSchema = "<my_schema>",
sfWarehouse = "<my_warehouse>",
sfRole = "<my_role>"
)
and my query is a string appended to the 'options' arguement e.g.
client_Q <- 'SELECT * FROM <my_database>.<my_schema>.<my_table>'
I can't understand why it is taking so long, if I run the same query from RStudio using a local spark instance and 'dbGetQuery', it is instant.
Is spark_read_source the problem? Is it an issue between Snowflake and Databricks? Or something else? Any help would be great. Thanks.

H2 show value of DB_CLOSE DELAY (set by SET DB_CLOSE_DELAY)

The H2 Database has a list of commands starting with SET, in particular SET DB_CLOSE_DELAY. I would like to find out what the value of DB_CLOSE_DELAY is. I am using JDBC. Setting is easy
cx.createStatement.execute("SET DB_CLOSE_DELAY 0")
but none of the following returns the actual value of DB_CLOSE_DELAY:
cx.createStatement.executeQuery("DB_CLOSE_DELAY")
cx.createStatement.executeQuery("VALUES(#DB_CLOSE_DELAY)")
cx.createStatement.executeQuery("GET DB_CLOSE_DELAY")
cx.createStatement.executeQuery("SHOW DB_CLOSE_DELAY")
Help would be greatly appreciated.
You can access this and other settings in the INFORMATION_SCHEMA.SETTINGS table - for example:
String url = "jdbc:h2:mem:;DB_CLOSE_DELAY=3";
Connection conn = DriverManager.getConnection(url, "sa", "the password goes here");
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM INFORMATION_SCHEMA.SETTINGS where name = 'DB_CLOSE_DELAY'");
while (rs.next()) {
System.out.println(rs.getString("name"));
System.out.println(rs.getString("value"));
}
In this test, I use an unnamed in-memory database, and I explicitly set the delay to 3 seconds when I create the DB.
The output from the print statements is:
DB_CLOSE_DELAY
3

[Snowflake-jdbc]It hangs when get info from resetset object of connection.getMetadata().getColumns(...)

I try to test the jdbc connection of snowflake with codes below
Connection conn = .......
.......
ResultSet rs = conn.getMetaData().getColumns(**null**, "PUBLIC", "TAB1", null); // 1. set parameters to get metadata of table TAB1
while (rs.next()) { // 2. It hangs here if the first parameter is null in above liune; otherwise(set the corrent db name), it works fine
System.out.println( "precision:" + rs.getInt(7)
+ ",col type name:" + rs.getString(6)
+ ",col type:" + rs.getInt(5)
+ ",col name:" + rs.getString(4)
+ ",CHAR_OCTET_LENGTH:" + rs.getInt(16)
+ ",buf LENGTH:" + rs.getString(8)
+ ",SCALE:" + rs.getInt(9));
}
.......
I debug the codes above in Intellij IDEA, and find that the debugger can't get the details of the object, it always shows "Evaluating..."
The JDBC driver I used is snowflake-jdbc-3.12.5.jar
Is it a bug?
When the catalog (database) argument is null, the JDBC code effectively runs the following SQL, which you can verify in your Snowflake account's Query History UIs/Views:
show columns in account;
This is an expensive metadata query to run due to no filters and the wide requested breadth (columns across the entire account).
Depending on how many databases exist in your organization's account, it may require several minutes or upto an hour of execution to return back results, which explains the seeming "hang". On a simple test with about 50k+ tables dispersed across 100+ of databases and schemas, this took at least 15 minutes to return back results.
I debug the codes above in Intellij IDEA, and find that the debugger can't get the details of the object, it always shows "Evaluating..."
This may be a weirdness with your IDE, but in a pinch you can use the Dump Threads (Ctrl + Escape, or Ctrl + Break) option in IDEA to provide a single captured thread dump view. This should help show that the JDBC client thread isn't hanging (as in, its not locked or starved), it is only waiting on the server to send back results.
There is no issue with the 3.12.5 jar.I just tested the same version in Eclipse, I can inspect all the objects . Could be an issue with your IDE.
ResultSet columns = metaData.getColumns(null, null, "TESTTABLE123",null);
while (columns.next()){
System.out.print("Column name and size: "+columns.getString("COLUMN_NAME"));
System.out.print("("+columns.getInt("COLUMN_SIZE")+")");
System.out.println(" ");
System.out.println("COLUMN_DEF : "+columns.getString("COLUMN_DEF"));
System.out.println("Ordinal position: "+columns.getInt("ORDINAL_POSITION"));
System.out.println("Catalog: "+columns.getString("TABLE_CAT"));
System.out.println("Data type (integer value): "+columns.getInt("DATA_TYPE"));
System.out.println("Data type name: "+columns.getString("TYPE_NAME"));
System.out.println(" ");
}

DB2 iSeries doesn't lock on select for update

I'm migrating a legacy application using DB2 iSeries on AS400 that has a specific behavior that I have to reproduce using .NET and DB2.Data.DB2.iSeries client for .NET.
What I'm describing works for me with DB2 non AS400 but in AS400 DB2 it worlks for the legacy application i'm replacing - but not with my application.
The behavior in the original application:
Begin Transaction
ExecuteReader () => Select col1 from table1 where col1 = 1 for update.
The row is now locked. anyone else who tries to run Select for update should fail.
Close the Reader opened in line 2.
The row is now unlocked. - anyone else who tried to run select for update should succeed.
Close transaction and live happily ever after.
In my .NET code I have two problems:
Step 2 - only checks if the row is already locked - but doesn't actually lock it. so another user can and does run select for update - WRONG BEHAVIOUR
Once that works - I need the lock to get unlocked when the reader is closed (step 4)
Here's my code:
var cb = new IBM.Data.DB2.iSeries.iDB2ConnectionStringBuilder();
cb.DataSource = "10.0.0.1";
cb.UserID = "User";
cb.Password = "Password";
using (var con = new IBM.Data.DB2.iSeries.iDB2Connection(cb.ToString()))
{
con.Open();
var t = con.BeginTransaction(System.Data.IsolationLevel.ReadUncommitted);
using (var c = con.CreateCommand())
{
c.Transaction = t;
c.CommandText = "select col1 from table1 where col1=1 FOR UPDATE";
using (var r = c.ExecuteReader())
{
while (r.Read()) {
MessageBox.Show(con.JobName + "The Row Should Be Locked");
}
}
MessageBox.Show(con.JobName + "The Row Should Be unlocked");
}
}
When you run this code twice - you'll see both processes reach the "This row should be locked" which is the problem I'm describing.
The desired result would be that the first process will reach the "This row should be locked" and that the second process will fail with resource busy error.
Then when the first process reaches the second message box - "the row should be unlocked" the second process( after running again ) will reach the "This row should be locked" message.
Any help would be greatly appreciated
The documentation says:
When the UPDATE clause is used, FETCH operations referencing the cursor acquire an exclusive row lock.
This implies a cursor is being used, and the lock occurs when the fetch statement is executed. I don't see a cursor, or a fetch in your code.
Now, whether .NET handles this as a cursor, I don't know, but the DB2 UDB documentation does not have this notation.
Isolation Level allows this behavior. Reading rows that are locked.
ReadUncommitted
A dirty read is possible, meaning that no shared locks are issued and no exclusive locks are honored.
After much investigations we created a work around in the form of a stored procedure that performs the lock for us.
The stored procedure looks like this:
CREATE PROCEDURE lib.Select_For_Update (IN SQL CHARACTER (5000) )
MODIFIES SQL DATA CONCURRENT ACCESS RESOLUTION WAIT FOR OUTCOME
DYNAMIC RESULT SETS 1 OLD SAVEPOINT LEVEL COMMIT ON RETURN
NO DISALLOW DEBUG MODE SET OPTION COMMIT = *CHG BEGIN
DECLARE X CURSOR WITH RETURN TO CLIENT FOR SS ;
PREPARE SS FROM SQL ;
OPEN X ;
END
Then we call it using:
var cb = new IBM.Data.DB2.iSeries.iDB2ConnectionStringBuilder();
cb.DataSource = "10.0.0.1";
cb.UserID = "User";
cb.Password = "Password";
using (var con = new IBM.Data.DB2.iSeries.iDB2Connection(cb.ToString()))
{
con.Open();
var t = con.BeginTransaction(System.Data.IsolationLevel.ReadUncommitted);
using (var c = con.CreateCommand())
{
c.Transaction = t;
c.CommandType = CommandType.StoredProcedure;
c.AddParameter("sql","select col1 from table1 where col1=1 FOR UPDATE");
c.CommandText = "lib.Select_For_Update"
using (var r = c.ExecuteReader())
{
while (r.Read()) {
MessageBox.Show(con.JobName + "The Row Should Be Locked");
}
}
MessageBox.Show(con.JobName + "The Row Should Be unlocked");
}
}
We don't like it - but it works.

Netezza Batch Insert is very slow even in Batch execute mode

I am referring to this documentation. http://www-01.ibm.com/support/docview.wss?uid=swg21981328. As per the article if we use executeBatch method then inserts will be faster (The Netezza JDBC driver may detect a batch insert, and under the covers convert this to an external table load and external table load will be faster). I had to execute millions of insert statements and i am getting only a speed of 500 records per minute per connection max. Is there any better way to load data faster to netezza via jdbc connection? I am using spark and jdbc connection to insert the records.Why external table via loading is not happening even when i am executing in batches. Given below is the spark code i am using,
Dataset<String> insertQueryDataSet.foreachPartition( partition -> {
Connection conn = NetezzaConnector.getSingletonConnection(url, userName, pwd);
conn.setAutoCommit(false);
int commitBatchCount = 0;
int insertBatchCount = 0;
Statement statement = conn.createStatement();
//PreparedStatement preparedStmt = null;
while(partition.hasNext()){
insertBatchCount++;
//preparedStmt = conn.prepareStatement(partition.next());
statement.addBatch(partition.next());
//statement.addBatch(partition.next());
commitBatchCount++;
if(insertBatchCount % 10000 == 0){
LOGGER.info("Before executeBatch.");
int[] execCount = statement.executeBatch();
LOGGER.info("After execCount." + execCount.length);
LOGGER.info("Before commit.");
conn.commit();
LOGGER.info("After commit.");
}
}
//execute remaining statements
statement.executeBatch();
int[] execCount = statement.executeBatch();
LOGGER.info("After execCount." + execCount.length);
conn.commit();
conn.close();
});
I tried this approach(batch insert) but found very slow,
So I put all data in CSV & do external table load for each csv.
InsertReq="Insert into "+ tablename + " select * from external '"+ filepath + "' using (maxerrors 0, delimiter ',' unase 2000 encoding 'internal' remotesource 'jdbc' escapechar '\' )";
Jdbctemplate.execute(InsertReq);
Since I was using java so JDBC as source & note that csv file path is in single quotes .
Hope this helps.
If you find better than this approach, don't forget to post. :)

Resources