How to take data in portions from Oracle using Mybatis? - oracle

In my application I am making a query to oracle and getting data this way
<select id="getAll" resultType="com.mappers.MyOracleMapper">
SELECT * FROM "OracleTable"
</select>
I get all the data, the problem is that there is a lot of data and it will take too much time to process all the data at once, since the response from the database will come in 3-4 minutes, this is not convenient.
How to make it so that I receive lines in portions without using the id field (since it does not exist, I do not know why). That is, take the first portion of lines, for example, the first 50, process them and take the next portion. It would be desirable to place a variable in properties that will be responsible for the number of lines in portions.
I can't do this in mybatis. This is new to me. Thanks in advance.
there is such a field and it is unique
OFFSET 10 ROWS
FETCH NEXT 10 ROWS ONLY
don't work, because the version is earlier than 12c

If you want to read millions of rows that's going to take time. It's normal to expect a few minutes to read and receive all the data over the wire.
Now, you have two options:
Use a Cursor
In MyBatis you can read the result of the query using the buffering a cursor gives you. The cursor reads a few hundred rows at a time and your app reads them one by one. Your app doesn't notice that behind the scenes there is buffering. Pretty good. For example, you can do:
Cursor<Client> clients = this.sqlSession.selectCursor("getAll");
for (Client c : clients) {
// process one client
}
Consider that cursors remain open until the end of the transaction. If you close the transaction (or exit the method marked as #Transactional) the cursor won't be usable anymore.
Use Manual Pagination
This solution can work well for the first pages of the result set, but it becomes increasingly inefficient and slooooooow the more you advance in the result set. Use it only as a last resort.
The only case where this strategy can be efficient is when you have the chance of implementing "key set pagination". I assume it's not the case here.
You can modify your query to perform explicit pagination. For example, you can do:
<select id="getPage" resultType="com.mappers.MyOracleMapper">
select * from (
SELECT rownum rnum, x.*
FROM OracleTable
WHERE rownum <= #{endingRow}
ORDER BY id
) x
where rnum >= #{startingRow}
</select>
You'll need to provide the extra parameters startingRow and endingRow.
NOTE: It's imperative you include an ORDER BY clause. Otherwise the pagination logic is meaningless. Choose any ordering you want, preferrably something that is backed up by an existing index.

Related

Getting next sequence value in correct order

I have a function in oracle database that gets me the next value of the sequence. I also have a following PySpark Code:
def get_next_seq_value():
QUERY = "SELECT SCHEMA.GET_NEXT_SEQ_VALUE FROM DUAL"
sqlContext.clearCache()
next_seq_value_df = sqlContext.read.format("jdbc").options(url=URL, driver=DRIVER, QUERY=QUERY, user=USER, password=PASSWORD).load().unpersist()
next_seq_value = next_seq_value_df.take(1)[0][0]
return next_seq_value
And I call this function from here:
array = []
for each_item in df_list:
next_seq_value = get_next_seq_value().encode('utf-8').strip()
array.append(next_seq_value)
The problem is the following:
When I run the following the array looks like this:
['545671', '545672', '545673', '545694', '545695', '545696']
Why don't I see the 545674 and 545675... it just skipped to '545694'. How do I make sure it calls the function in order.
Default sequence cache size is 20:
If you omit both CACHE and NOCACHE, then the database caches 20 sequence numbers by default.
So looks like another session called nextval of your sequence between your calls.
In addition from your code QUERY = "SELECT SCHEMA.GET_NEXT_SEQ_VALUE FROM DUAL" looks like you wrapped your_sequence.nextval into the function GET_NEXT_SEQ_VALUE. It looks like overkill here: you get extra calls (SQL->PL/SQL-> call .nextval()) and overhead here. You can either use just select seq.nextval from dual or :x := seq.nextval;. And if you want to generate N values, you can use: select seq.nextval from dual connect by level<=20;
Totally agree with both of the previous answers. I'm not sure what type of database architecture you're using, but I'd also like to point out that with Oracle RAC each cluster node instance will have a separate cache for the sequence too.
Eg:
node 1: sequence cache 101-120
node 2: sequence cache 121-140
node 3: sequence cache 141-160
So depending on which node happens to process a request the nextval might not be in sequential order, either.
The point is that when using sequences you should only count on the values being unique, not necessarily without gaps (eliminating the cache can impact performance severely), or even necessarily in sequential order depending on your physical server architecture. If keeping things in sequential order no matter what is important, add a timestamp to your record in addition to the sequence counter.
Your problem is apparently not the wrong order of the *sequence generated IDs but the gaps.
While you decide to use sequences you generally must count with gaps.
If you use the default cache size of 20 you will loose on average with end of each session 10 IDs.
You may reduce this with NOCACHE but even here is you call the nextvaland than rollback the transaction this ID may gets lost. As the next transaction typically starts with a new nextval...

Camel + CassandraQL : Process a table without putting all in memory

Goal: read a big Cassandra table, process line by line in parallel
Constraints:
not all rows in memory
no Spark, we have to use Camel
One shot, no need polling the table
I did a first working version with CassandraQL but this Component seems to be limited to one query with all in memory, I did not find mechanics like fetSize/fetchMoreResult. I looked CassandraProducer class, PollingConsumerPollingStrategy, ResultSetConversionStrategy... See nothing.
Could it be possible to read a table by chunks of 1000 elements for example, each chunk would generate an exchange lately split in different threads ?
I think that maybe the ProducerTemplate injecting first exchanges in the route could be the answer. But I don't undertand how I could manage the production exchange rate to avoid to have too many rows in memory (to do so, we would need for example to check the size of the next blocking queue, if more than X no consumed elements, wait before producing more messages).
Maybe there are other options to do something like this ?
Maybe I did not see the magic parameter in CassandraQL ?
Maybe we can override some parts of CassandraQL ?
Thank you
This is not going to be answer to be a your question but hope to kick off some discussion. But as someone learning Cassandra and spending bit of time, it made me thinking. And mainly targets fetSize/fetchMoreResult part of the question
First of all, two of your constraints are contradicting
Not all rows in memory
I don't want all them fetched at once
One shot, no need polling the partition
I don't want to go back to db more than once.
Unless what you actually you meant is
Not all rows in memory
I don't want all them fetched at once
You can go back to partition many times, as long as you go back straight to where you left it last time.
As long as the time it takes for the first page is same as time it takes for the second page. And the time it takes for the 19th Page is same as the time it takes for the 20th page.
i.e Not starting from the first row
So I am going to assume that what you meant is Second Scenario and go with it.
Queries for Cassandra are going to satisfy the following two:
They are going to have a restriction on clustering columns
They are already ordered by clustering columns
Now Consider the following table
department(partition key), firstName(clustering_key), personId(clustering_key), lastname, etc as normal cols
First query
select department, firstName, lastname, etc
from person
where department = 'depart1`
order by firstName ASC
limit 25;
Second query (lets say last record in the page had userId=25 and firstName=kavi)
select department, firstName, lastname, etc
from person
where department = 'depart1` and firstName='kavi' and userId > 25
order by firstName ASC
limit 25;
As you can see, we can easily construct a Cassandra query that brings each chunk with certain size in constant time.
Now back to integration framework
I remember this concept called watermark in mule where the endpoint can store and remember so that they can start from there next time. In this case, value of userId and firstName of the last record of the last page is the watermark. So they can issue the second. I am sure we should be able to do the same with camel
I hope I have convinced that polling is not an issue where each chunk is retrieved in constant time

Sqlite view vs plain select statement performance

I have a simple table (with about 8 columns and a LOT of rows) in a SQLite database. There is a single program that runs as a service and performs selects, updates and inserts on the table quite often (approximately every 5 minutes). The selects are used only to determine which rows are to be updated, and they are based on a column that holds boolean values (probably translated to integer internally by SQLite).
There is also a web application that performs selects (always with a GROUP BY clause) whenever a web user wishes to view part of the data.
There are two ways to ask for data through the web application: (a) predefined filters (i.e. the where clause has specific conditions on 3 specific columns) an (b) custom filters (i.e. the user chooses the values for the conditions, but the columns participating in the where clause are the same as in (a)). As mentioned, in both cases there is a GROUP BY operation.
I am wondering whether using a view or a custom function might increase the performance. Currently, a "custom" select may take more than 30 seconds to complete - and that's before any data has been sent back to the user.
EDIT:
Using EXPLAIN QUERY PLAN on a "predefined" select statement yields only one row:
0|0|TABLE mytable
Using EXPLAIN on the same query, yields the following:
0|OpenVirtual|1|4|keyinfo(2,-BINARY,BINARY)
1|OpenVirtual|2|3|keyinfo(1,BINARY)
2|MemInt|0|5|
3|MemInt|0|4|
4|Goto|0|27|
5|MemInt|1|5|
6|Return|0|0|
7|IfMemPos|4|9|
8|Return|0|0|
9|AggFinal|0|0|count(0)
10|AggFinal|2|1|sum(1)
11|MemLoad|0|0|
12|MemLoad|1|0|
13|MemLoad|2|0|
14|MakeRecord|3|0|
15|MemLoad|0|0|
16|MemLoad|1|0|
17|Sequence|1|0|
18|Pull|3|0|
19|MakeRecord|4|0|
20|IdxInsert|1|0|
21|Return|0|0|
22|MemNull|1|0|
23|MemNull|3|0|
24|MemNull|0|0|
25|MemNull|2|0|
26|Return|0|0|
27|Gosub|0|22|
28|Goto|0|82|
29|Integer|0|0|
30|OpenRead|0|2|
31|SetNumColumns|0|9|
32|Rewind|0|48|
33|Column|0|8|
34|String8|0|0|123456789
35|Le|356|39|collseq(BINARY)
36|Column|0|3|
37|Integer|180|0|
38|Gt|100|42|collseq(BINARY)
39|Column|0|7|
40|Integer|1|0|
41|Ne|356|47|collseq(BINARY)
42|Column|0|6|
43|Sequence|2|0|
44|Column|0|3|
45|MakeRecord|3|0|
46|IdxInsert|2|0|
47|Next|0|33|
48|Close|0|0|
49|Sort|2|69|
50|Column|2|0|
51|MemStore|7|0|
52|MemLoad|6|0|
53|Eq|512|58|collseq(BINARY)
54|MemMove|6|7|
55|Gosub|0|7|
56|IfMemPos|5|69|
57|Gosub|0|22|
58|AggStep|0|0|count(0)
59|Column|2|2|
60|Integer|30|0|
61|Add|0|0|
62|ToReal|0|0|
63|AggStep|2|1|sum(1)
64|Column|2|0|
65|MemStore|1|1|
66|MemInt|1|4|
67|Next|2|50|
68|Gosub|0|7|
69|OpenPseudo|3|0|
70|SetNumColumns|3|3|
71|Sort|1|80|
72|Integer|1|0|
73|Column|1|3|
74|Insert|3|0|
75|Column|3|0|
76|Column|3|1|
77|Column|3|2|
78|Callback|3|0|
79|Next|1|72|
80|Close|3|0|
81|Halt|0|0|
82|Transaction|0|0|
83|VerifyCookie|0|1|
84|Goto|0|29|
85|Noop|0|0|
The select I used was as the following
SELECT
COUNT(*) as number,
field1,
SUM(CAST(filter2 +30 AS float)) as column2
FROM
mytable
WHERE
(filter1 > '123456789' AND filter2 > 180)
OR filter3=1
GROUP BY
field1
ORDER BY
number DESC, field1;
Whenever you're going to be doing comparisons of a non-primary-key field, it's a good design idea to add an index into to the field(s). Too many, however, can cause INSERTs to crawl, so plan accordingly.
Also, if you have simple fields such as ones that only hold a boolean value, you may want to consider declaring it as an INTEGER instead of whatever you declared it as. Declaring it as any type not specifically defined by SQLite will cause it to default to a NUMERIC type which will take longer to compare values because it will store it internally as a double and will use the floating-point math processor instead of the integer math processor.
IMO, the GROUP BY sorting directive is sometimes a dead giveaway to an unoptimized query; its methodology involves eliminating redundant data which could have been eliminated beforehand if it hadn't been pulled out of the database to begin with.
EDIT:
I saw your query and saw there are some simple things you can do to optimize it:
SUM(CAST(filter2 +30 AS float)) is inefficient; why are you casting it as a float? Why not just SUM it then add 30 * the COUNT?
filter1 > '123456789' - Why the string comparison? Why not just use integer comparison?

performance of rand()

I have heard that I should avoid using 'order by rand()', but I really need to use it. Unlike what I have been hearing, the following query comes up very fast.
select
cp1.img_id as left_id,
cp1.img_filename as left_filename,
cp1.facebook_name as left_facebook_name,
cp2.img_id as right_id,
cp2.img_filename as right_filename,
cp2.facebook_name as right_facebook_name
from
challenge_photos as cp1
cross join
challenge_photos as cp2
where
(cp1.img_id < cp2.img_id)
and
(cp1.img_id,cp2.img_id) not in ((0,0))
and
(cp1.img_status = 1 and cp2.img_status = 1)
order by rand() limit 1
is this query considered 'okay'? or should I use queries that I can find by searching "alternative to rand()" ?
It's usually a performance thing. You should avoid, as much as possible, per-row functions since they slow down your queries.
That means things like uppercase(name), salary * 1.1 and so on. It also includes rand(). It may not be an immediate problem (at 10,000 rows) but, if you ever want your database to scale, you should keep it in mind.
The two main issues are the fact that you're performing a per-row function and then having to do a full sort on the output before selecting the first row. The DBMS cannot use an index if you sort on a random value.
But, if you need to do it (and I'm not making judgement calls there), then you need to do it. Pragmatism often overcomes dogmatism in the real world :-)
A possibility, if performance ever becomes an issue, is to get a count of the records with something like:
select count(*) from ...
then choose a random value on the client side and use a:
limit <start>, <count>
clause in another select, adjusting for the syntax used by your particular DBMS. This should remove the sorting issue and the transmission of unneeded data across the wire.

Oracle (PL/SQL): Is UPDATE RETURNING concurrent?

I'm using table with a counter to ensure unique id's on a child element.
I know it is usually better to use a sequence, but I can't use it because I have a lot of counters (a customer can create a couple of buckets and each of them needs to have their own counter, they have to start with 1 (it's a requirement, my customer needs "human readable" keys).
I'm creating records (let's call them items) that have a prikey (bucket_id, num = counter).
I need to guarantee that the bucket_id / num combination is unique (so using a sequence as prikey won't fix my problem).
The creation of rows doesn't happen in pl/sql, so I need to claim the number (btw: it's not against the requirements to have gaps).
My solution was:
UPDATE bucket
SET counter = counter + 1
WHERE id = param_id
RETURNING counter INTO num_forprikey;
PL/SQL returns var_num_forprikey so the item record can be created.
Question:
Will I always get unique num_forprikey even if the user concurrently asks for new items in a bucket?
Will I always get unique num_forprikey
even if the user concurrently asks for
new items in a bucket?
Yes, at least up to a point. The first user to issue that update gets a lock on the row. So no other user can successfully issue that same statement until user numero uno commits (or rolls back). So uniqueness is guaranteed.
Obviously, the cavil is regarding concurrency. Your access to the row is serialized, so there is no way for two users to get a new PRIKEY simultaneously. This is not necessarily a problem. It depends on how many users you have creating new Items, and how often they do it. One user peeling off numbers in the same session won't notice a thing.
I seem to recall this problem from many years back working on of all things an INGRES database. There were no sequences in those days so a lot of effort was put into finding the best scaling solution for this problem by the top INGRES minds of the day. I was fortunate enough to be working along side them so that even though my mind is pitifully smaller than any of theirs, proxmity = residual affect and I retained something. This was one of the things. Let me see if I can remember.
1) for each counter you need row in a work table.
2) each time you need a number
a) lock the row
b) update it
c) get its new value (you use returning for this which I avoid like the plague)
d) commit the update to release your lock on the row
The reason for the commit is for trying to get some kind of scalability. There will always be a limit but you do not serialize on getting a number for any period of time.
In the oracle world we would improve the situation by using a function defined as an AUTONOMOUS_TRANSACTION in order to acquire the next number. IF you think about it, this solution requires that gaps be allowed which you said is OK. By commiting the number update independently of the main transaction, you gain scalability but you introduce gapping.
You will have to accept the fact that your scalability will drop dramatically in this scenario. This is due to at least two reasons:
1) the update/select/commit sequence does its best to reduce the time during which the KEY row is locked, but it is still not zero. Under heavy load, you will serialize and eventually be limited.
2) you are commiting on every key get. A commit is an expensive operation requiring many memory and file management actions on the part of the database. This will limit you also.
In the end you are likely looking at three or more orders of magnitude drop in concurrent transaction load because you are not using sequences. I base this on my experience of the past.
But if you customer requires it, what can you do right?
Good luck. I have not tested the code for syntax errors, I leave that to you.
create or replace function get_next_key (key_name_p in varchar2) return number is
pragma autonomous_transaction;
kev_v number;
begin
update key_table set key = key + 1 where key_name = key_name_p;
select key_name into key_name_v from key_name where key_name = key_name_p;
commit;
return (key_v);
end;
/
show errors
You can still use sequences, just use the row_number() analytic function to please your users. I described it here in more detail: http://rwijk.blogspot.com/2008/01/sequence-within-parent.html
Regards,
Rob.
I'd figure out how to make sequences work. It's the only guarantee, though an exception clause could be coded
http://www.orafaq.com/forum/t/83382/0/ The benefit to sequences (and they could be dynamically created, is you can specify nocache and guarantee order)

Resources