Offset Management - Confluent JDBC Connector in query mode - jdbc

As per the confluent documentation, when we use the query mode, we have to do the offset management. As per my understanding, we need to keep track of the last updated timestamp and pass it in the where clause when we restart the program each time. Could anyone confirm if the understanding is correct? Appreciate your help in advance!

You can do both - you can still set the timestamp and incrementing mode in addition to the query. it will simply add a where statement based on timestamp.column.name and/or incrementing.column.name field. You can even use a subquery if your query needs a where statement
As an example you could set your query to: select * from (select apples from tree where color = green) as subquery
with the timestamp.column.name set to ripedate the sql kafka will execute is:
select * from (select apples from tree where color = green) as subquery where ripedate > offsetdate

Related

Nifi executeSql with 30 threads very slow

We are using HDF to fetch large data from oracle. We have a generateTableFetch to create partition of 8000 records which create query like below :
Select * from ( Select a.*, ROWNUM rnum FROM (SELECT * FROM OPUSER.DEPENDENCY_TYPES WHERE (1=1))a WHERE ROWNUM <= 368000) WHERE rnum > 361000
Now this query is taking almost 20-25min to return from oracle.
Is there anything wrong that we are doing wrong or any configuration changes we can do.
Nifi uses jdbc connection so is there any oracle side configuration for that.
Also if we somehow add parallelism hint to the query example /parallel(c,2)/. WIll this help?
I'm guessing you're using Oracle 11 (or less) and have selected Oracle as the database type. Since LIMIT/OFFSET wasn't introduced until Oracle 12, NiFi uses the nested SELECT with ROWNUM approach to ensure each "page" of data contains unique values. If you are using Oracle 12+, make sure to use the Oracle 12+ database adapter instead, as it can leverage the LIMIT/OFFSET capabilities resulting in a faster query. Also make sure you have the appropriate index(es) in place to help with query execution.
As of NiFi 1.7.0, you might also consider setting the Column for Value Partitioning property. If you have a column (perhaps your DEPENDENCY_TYPES column) that is fairly uniformly distributed, and is not "too sparse" in relation to your Partition Size property value, GenerateTableFetch can use the column's values rather than the ROWNUM approach, resulting in faster queries. See NIFI-5143 and the GenerateTableFetch documentation for more details.
If you need to add hints to the JDBC session, then as of NiFi 1.9.0 (see NIFI-5780 for more details) you can add pre- and post-query statements to ExecuteSQL.

Oracle not using an index in a simple query, despite the hint

I have a table with column status. It is a string nullable column. I also have an index on this field only. Why is the following query not using the index?
select /*+ index(m IDX_STATUS) */ * from messages m where m.status = :1
Try running the query without named (bind) parameters. Sometimes that makes a big difference.
select * from messages m where m.status = 'P'
It may turn out that you don't even need a hint to trigger index usage.
A possible explanation is that the column contains many equal values, for example 90% of rows has status = 'D' (low-cardinality column). Now we can understand Oracle, why it did not use the index :) It simply doesn't make sense on value 'D', but is reasonable for other values. I would prefer Oracle to consider my hint (I know better), but that seems undoable.
In general there is an extremely useful guide Oracle SQL Tuning Guide: The index is being ignored. Still it does not mention the situation, where omitting bind parameter makes the problem go away. That's why I insisted on asking the question on SO.

Write a nested select statement with a where clause in Hive

I have a requirement to do a nested select within a where clause in a Hive query. A sample code snippet would be as follows;
select *
from TableA
where TA_timestamp > (select timestmp from TableB where id="hourDim")
Is this possible or am I doing something wrong here, because I am getting an error while running the above script ?!
To further elaborate on what I am trying to do, there is a cassandra keyspace that I publish statistics with a timestamp. Periodically (hourly for example) this stats will be summarized using hive, once summarized that data will be stored separately with the corresponding hour. So when the query runs for the second time (and consecutive runs) the query should only run on the new data (i.e. - timestamp > previous_execution_timestamp). I am trying to do that by storing the latest executed timestamp in a separate hive table, and then use that value to filter out the raw stats.
Can this be achieved this using hive ?!
Subqueries inside a WHERE clause are not supported in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
However, often you can use a JOIN statement instead to get to the same result:
https://karmasphere.com/hive-queries-on-table-data#join_syntax
For example, this query:
SELECT a.KEY, a.value
FROM a
WHERE a.KEY IN
(SELECT b.KEY FROM B);
can be rewritten to:
SELECT a.KEY, a.val
FROM a LEFT SEMI JOIN b ON (a.KEY = b.KEY)
Looking at the business requirements underlying your question, it occurs that you might get more efficient results by partitioning your Hive table using hour. If the data can be written to use this factor as the partition key, then your query to update the summary will be much faster and require fewer resources.
Partitions can get out of hand when they reach the scale of millions, but this seems like a case that will not tease that limitation.
It will work if you put in :
select *
from TableA
where TA_timestamp in (select timestmp from TableB where id="hourDim")
EXPLANATION : As > , < , = need one exact figure in the right side, while here we are getting multiple values which can be taken only with 'IN' clause.

How can I see the SQL execution plan in Oracle?

I'm learning about database indexes right now, and I'm trying to understand the efficiency of using them.
I'd like to see whether a specific query uses an index.
I want to actually see the difference between executing the query using an index and without using the index (so I want to see the execution plan for my query).
I am using sql+.
How do I see the execution plan and where can I found in it the information telling me whether my index was used or not?
Try using this code to first explain and then see the plan:
Explain the plan:
explain plan
for
select * from table_name where ...;
See the plan:
select * from table(dbms_xplan.display);
Edit: Removed the brackets
The estimated SQL execution plan
The estimated execution plan is generated by the Optimizer without executing the SQL query. You can generate the estimated execution plan from any SQL client using EXPLAIN PLAN FOR or you can use Oracle SQL Developer for this task.
EXPLAIN PLAN FOR
When using Oracle, if you prepend the EXPLAIN PLAN FOR command to a given SQL query, the database will store the estimated execution plan in the associated PLAN_TABLE:
EXPLAIN PLAN FOR
SELECT p.id
FROM post p
WHERE EXISTS (
SELECT 1
FROM post_comment pc
WHERE
pc.post_id = p.id AND
pc.review = 'Bingo'
)
ORDER BY p.title
OFFSET 20 ROWS
FETCH NEXT 10 ROWS ONLY
To view the estimated execution plan, you need to use DBMS_XPLAN.DISPLAY, as illustrated in the following example:
SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY (FORMAT=>'ALL +OUTLINE'))
The ALL +OUTLINE formatting option allows you to get more details about the estimated execution plan than using the default formatting option.
Oracle SQL Developer
If you have installed SQL Developer, you can easily get the estimated execution plan for any SQL query without having to prepend the EXPLAIN PLAN FOR command:
##The actual SQL execution plan
The actual SQL execution plan is generated by the Optimizer when running the SQL query. So, unlike the estimated Execution Plan, you need to execute the SQL query in order to get its actual execution plan.
The actual plan should not differ significantly from the estimated one, as long as the table statistics have been properly collected by the underlying relational database.
GATHER_PLAN_STATISTICS query hint
To instruct Oracle to store the actual execution plan for a given SQL query, you can use the GATHER_PLAN_STATISTICS query hint:
SELECT /*+ GATHER_PLAN_STATISTICS */
p.id
FROM post p
WHERE EXISTS (
SELECT 1
FROM post_comment pc
WHERE
pc.post_id = p.id AND
pc.review = 'Bingo'
)
ORDER BY p.title
OFFSET 20 ROWS
FETCH NEXT 10 ROWS ONLY
To visualize the actual execution plan, you can use DBMS_XPLAN.DISPLAY_CURSOR:
SELECT *
FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(FORMAT=>'ALLSTATS LAST ALL +OUTLINE'))
Enable STATISTICS for all queries
If you want to get the execution plans for all queries generated within a given session, you can set the STATISTICS_LEVEL session configuration to ALL:
ALTER SESSION SET STATISTICS_LEVEL='ALL'
This will have the same effect as setting the GATHER_PLAN_STATISTICS query hint on every execution query. So, just like with the GATHER_PLAN_STATISTICS query hint, you can use DBMS_XPLAN.DISPLAY_CURSOR to view the actual execution plan.
You should reset the STATISTICS_LEVEL setting to the default mode once you are done collecting the execution plans you were interested in. This is very important, especially if you are using connection pooling, and database connections get reused.
ALTER SESSION SET STATISTICS_LEVEL='TYPICAL'
Take a look at Explain Plan. EXPLAIN works across many db types.
For sqlPlus specifically, see sqlplus's AUTO TRACE facility.
Try this:
http://www.dba-oracle.com/t_explain_plan.htm
The execution plan will mention the index whenever it is used. Just read through the execution plan.

How does oracle execute an sql statement?

such as:
select country
from table1
inner join table2 on table1.id=table2.id
where table1.name='a' and table2.name='b'
group by country
after the parse, which part will be executed first?
It looks like you want to know the execution plan chosen by Oracle. You can get that ouput from Oracle itself:
set serveroutput off
< your query with hint "/*+ gather_plan_statistics */" inserted after SELECT >
select * from table(dbms_xplan.display_cursor(null, null, 'last allstats'));
See here for an explanation how to read a query plan: http://download.oracle.com/docs/cd/E11882_01/server.112/e16638/ex_plan.htm#i16971
Be aware however that the choice of a query plan is not fixed. Oracle tries to find the currently best query plan, based on available statistics data.
There are plenty of places you can find the order in which SQL is executed:
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
But note that this is the "theoretical" order - SQL engines are allowed to perform the operations in other orders, provided that the end result appears to have been produced by using the above order.
If you install the free tool SQL*Developer from Oracle, then you can click a button to get the explain plan.
A quick explanation is at http://www.seeingwithc.org/sqltuning.html

Resources