Need to commit after each 500 rows deleted - oracle

I am invoking delete queries in a loop using JDBC. The number of records that are going to be deleted by a particular delete query is not consistent. It can be 40, 80 or 100 etc. My scenario is I need to commit after each 500 records deletion.
The way I implemented is, I am accumulating the count of the records that are going to be deleted by the particular delete query, until the count is equal or more than 500. As soon the count becomes equal to or more than 500, I am pausing count accumulation and invoking the delete query and committing to delete 500 records in one shot.
Is there any other better approach or JDBC standard way to do this?

You can use two JDBC APIs for deleting: You can run a query like delete ... where PK in (...) and pass an array of 500 primary keys. Or you can use JDBC batch to batch a series of SQL queries and run them as one.
Batch is just a way to package several SQL queries in a single "conversation" with the database. The performance gain is mostly network: Instead of sending hundreds of small packages, you send a few big ones. Parsing and processing on the DB side is the same.
But if you do a more complex query and collect the number of changed rows, then there is no special JDBC support for that. You have to accumulate the count yourself and then commit manually when the count gets too big.

Related

Data cleanup in Oracle DB is taking long time for 300 billion records

Problem statement:
There is address table in Oracle which is having relationship with multiple tables like subscriber, member etc.
Currently design is in such a way that when there is any change in associated tables, it increments record version throughout all tables.
So new record is added in address table even if same address is already present, resulting into large number of duplicate copies.
We need to identify and remove duplicate records, and update foreign keys in associated tables while making sure it doesn't impact the running application.
Tried solution:
We have written a script for cleanup logic, where unique hash is generated for every address. If calculated hash is already present then it means address is duplicate, where we merge into single address record and update foreign keys in associated tables.
But the problem is there are around 300 billion records in address table, so this cleanup process is taking lot of time, and it will take several days to complete.
We have tried to have index for hash column, but process is still taking time.
Also we have updated the insertion/query logic to use addresses as per new structure (using hash, and without version), in order to take care of incoming requests in production.
We are planning to do processing in chunks, but it will be very long an on-going activity.
Questions:
Would like to if any further improvement can be made in above approach
Will distributed processing will help here? (may be using Hadoop Spark/hive/MR etc.)
Is there any some sort of tool that can be used here?
Suggestion 1
Use built-in delete parallel
delete /*+ parallel(t 8) */ mytable t where ...
Suggestion 2
Use distributed processing (Hadoop Spark/hive) - watch out for potential contention on indexes or table blocks. It is recommended to have each process to work on a logical isolated subset, e.g.
process 1 - delete mytable t where id between 1000 and 1999
process 2 - delete mytable t where id between 2000 and 2999
...
Suggestion 3
If more than ~30% of the table need to be deleted - the fastest way would be to create an empty table, copy there all required rows, drop original table, rename new, create all indexes+constraints. Of course it requires downtime and it greatly depends on number of indexes - the more you have the longer it will take
P.S. There are no "magic" tools to do it. In the end they all run the same sql commands as you can.
It's possible use oracle merge instruction to insert data if you use clean sql.

What will happen when inserting a row during a long running query

I am writing some data loading code that pulls data from a large, slow table in an oracle database. I have read-only access to the data, and do not have the ability to change indexes or affect the speed of the query in any way.
My select statement takes 5 minutes to execute and returns around 300,000 rows. The system is inserting large batches of new records constantly, and I need to make sure I get every last one, so I need to save a timestamp for the last time I downloaded the data.
My question is: If my select statement is running for 5 minutes, and new rows get inserted while the select is running, will I receive the new rows or not in the query result?
My gut tells me that the answer is 'no', especially since a large portion of those 5 minutes is just the time spent on the data transfer from the database to the local environment, but I can't find any direct documentation on the scenario.
"If my select statement is running for 5 minutes, and new rows get inserted while the select is running, will I receive the new rows or not in the query result?"
No. Oracle enforces strict isolation levels and does not permit dirty reads.
The default isolation level is Read Committed. This means the result set you get after five minutes will be identical to the one you would have got if Oracle could have delivered you all the records in 0.0000001 seconds. Anything committed after you query started running will not be included in the results. That includes updates to the records as well as inserts.
Oracle does this by tracking changes to the table in the UNDO tablespace. Provided it can restrict the original image from that data your query will run to completion; if for any reason the undo information is overwritten your query will fail with the dreaded ORA-1555: Snapshot too old. That's right: Oracle would rather hurl an exception than provide us with an inconsistent result set.
Note that this consistency applies at the statement level. If we run the same query twice within the one transaction we may see two different results sets. If that is a problem (I think not in your case) we need to switch from Read Committed to Serialized isolation.
The Concepts Manual covers Concurrency and Consistency in great depth. Find out more.
So to answer your question, take the timestamp from the time you start the select. Specifically, take the max(created_ts) from the table before you kick off the query. This should protect you from the gap Alex mentions (if records are not committed the moment they are inserted there is the potential to lose records if you base the select on comparing with the system timestamp). Although doing this means you're issuing two queries in the same transaction which means you do need Serialized isolation after all!

Multiple small inserts in clickhouse

I have an event table (MergeTree) in clickhouse and want to run a lot of small inserts at the same time. However the server becomes overloaded and unresponsive. Moreover, some of the inserts are lost. There are a lot of records in clickhouse error log:
01:43:01.668 [ 16 ] <Error> events (Merger): Part 201 61109_20161109_240760_266738_51 intersects previous part
Is there a way to optimize such queries? I know I can use bulk insert for some types of events. Basically, running one insert with many records, which clickhouse handles pretty well. However, some of the events, such as clicks or opens could not be handled in this way.
The other question: why clickhouse decides that similar records exist, when they don't? There are similar records at the time of insert, which have the same fields as in index, but other fields are different.
From time to time I also receive the following error:
Caused by: ru.yandex.clickhouse.except.ClickHouseUnknownException: ClickHouse exception, message: Connect to localhost:8123 [ip6-localhost/0:0:0:0:0:0:0:1] timed out, host: localhost, port: 8123; Connect to ip6-localhost:8123 [ip6-localhost/0:0:0:0:0:0:0:1] timed out
... 36 more
Mostly during project build when test against clickhouse database are run.
Clickhouse has special type of tables for this - Buffer. It's stored in memory and allow many small inserts with out problem. We have near 200 different inserts per second - it works fine.
Buffer table:
CREATE TABLE logs.log_buffer (rid String, created DateTime, some String, d Date MATERIALIZED toDate(created))
ENGINE = Buffer('logs', 'log_main', 16, 5, 30, 1000, 10000, 1000000, 10000000);
Main table:
CREATE TABLE logs.log_main (rid String, created DateTime, some String, d Date)
ENGINE = MergeTree(d, sipHash128(rid), (created, sipHash128(rid)), 8192);
Details in manual: https://clickhouse.yandex/docs/en/operations/table_engines/buffer/
This is known issue when processing large number of small inserts into (non-replicated) MergeTree.
This is a bug, we need to investigate and fix.
For workaround, you should send inserts in larger batches, as recommended: about one batch per second: https://clickhouse.tech/docs/en/introduction/performance/#performance-when-inserting-data.
I've had a similar problem, although not as bad - making ~20 inserts per second caused the server to reach a high loadavg, memory consumption and CPU use. I created a Buffer table which buffers the inserts in memory, and then they are flushed periodically to the "real" on-disk table. And just like magic, everything went quite: loadavg, memory and CPU usage came down to normal levels. The nice thing is that you can run queries against the buffer table, and get back matching rows from both memory and disk - so clients are unaffected by the buffering. See https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/
Alternatively, you can use something like https://github.com/nikepan/clickhouse-bulk: it will buffer multiple inserts and flush them all together according to user policy.
The design of clickhouse MergeEngines is not meant to take small writes concurrently. The MergeTree as much as I understands merges the parts of data written to a table into based on partitions and then re-organize the parts for better aggregated reads. If we do small writes often you would encounter another exception that Merge
Error: 500: Code: 252, e.displayText() = DB::Exception: Too many parts (300). Merges are processing significantly slow
When you would try to understand why the above exception is thrown the idea will be a lot clearer. CH needs to merge data and there is an upper limit as to how many parts can exist! And every write in a batch is added as a new part and then eventually merged with the partitioned table.
SELECT
table, count() as cnt
FROM system.parts
WHERE database = 'dbname' GROUP BY `table` order by cnt desc
The above query can help you monitor parts, observe while writing how the parts would increase and eventually merge down.
My best bet for the above would be buffering the data set and periodically flushing it to DB, but then that means no real-time analytics.
Using buffer is good, however please consider these points:
If the server is restarted abnormally, the data in the buffer is lost.
FINAL and SAMPLE do not work correctly for Buffer tables. These conditions are passed to the destination table, but are not used for processing data in the buffer
When adding data to a Buffer, one of the buffers is locked. (So no reads)
If the destination table is replicated, some expected characteristics of replicated tables are lost when writing to a Buffer table. (no deduplication)
Please read throughly, it's a special case engine: https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/

Power Query 'an evaluation is in progress' when merging

When merging two tables in PowerQuery an evaulation is run to determine the possible number of matches. I run pretty large tables (merge a 10K record table with a 500K record table) so this can take a long time.
I know there will be matches because I have done this before and I am not a beginner. Yet PowerQuery insists on running this behaviour.
Is there anyway to baypass this step? It almost feels like when you just need to turn automatic calculation off in Excel so that you can get on with actually doing something.
Any ideas?
I would add in an upstream filter to limit the rows e.g. Keep Rows / Keep Top Rows / 100. You may need to do this on both Queries. Ideally you Keep enough rows or use a specific filter to get some matches, to help your downstream Query design work.
Then once the query design is finished, I would remove the filter(s) and let it rip.
This is what PQ should be doing in the Query Editor, but it does seem to go rogue on Merge in particular.

ColdFusion's cfquery failing silently

I have a query that retrieves a large amount of data.
<cfsetting requesttimeout="9999999" >
<cfquery name="randomething" datasource="ds" timeout="9999999" >
SELECT
col1,
col2
FROM
table
</cfquery>
<cfdump var="#randomething.recordCount#" /> <!---should be about 5 million rows --->
I can successfully retrieve the data with python's cx_Oracle and using sys.getsizeof on the python list returns 22621060, so about 21 megabytes.
ColdFusion does not return an error on the page, and I can't find anything in any of the logs. Why is cfdump not showing the number of rows?
Additional Information
The reason for doing it this way is because I have about 8000 smaller queries to run against the randomthing query. In other words when I run those 8000 queries against the database it takes hours for that process to complete. I suspect this is because I am competing with several other database users, and the database is getting bogged down.
The 8000 smaller queries are getting counts of col1 over a period of col2.
SELECT
count(col1) as count
WHERE
col2 < 20121109
AND
col2 > 20121108
According to Adam Cameron's suggestions.
cflog is suggesting that the query isn't finishing.
I tried changing the queries timeout both in the code and in the CFIDE/administrator, apparently CF9 no long respects the timeout attribute, regardless of what I tried I couldn't get the query to timeout.
I also started playing around with the maxrows attribute to see if I could discern any information that way.
when maxrows is set to 1300000 everything works fine
when maxrows is 1400000 or greater I get this error
when maxrows is 2000000 I observe my original problem
Update
So this isn't a limit of cfquery. By using QueryNew then looping over it to add data and I can get well past the 2 million mark without any problems.
I also created a ThinClient datasource using the information in this question, I didn't observe any change in behavior.
The messages on the database end are
SQL*Net message from client
and
SQL*Net more data to client
I just discovered that by using the thin client along with blockfactor1="100" I can retrieve more rows (appx. 3000000).
Is there anything logged on the DB end of things?
I wonder if the timeout is not being respected, and JDBC is "hanging up" on the DB whilst it's working. That's a wild guess. What if you set a very low timeout - eg: 5sec - does it error after 5sec, or what?
The browser could be timing out too. What say you write something to a log before and after the <cfquery> block, with <cflog>. To see if the query is eventually finishing.
I have to wonder what it is you intend to do with these 22M records once you get them back to CF. Whatever it is, it sounds to me like CF is the wrong place to be doing whatever it is: CF ain't for heavy data processing, it's for making web pages. If you need to process 22M records, I suspect you should be doing it on the database. That said, I'm second-guessing what you're doing with no info to go on, so I presume there's probably a good reason to be doing it.
Have you tried wrapping your cfquery within cftry tags to see if that reports anything?
<cfsetting requesttimeout="600" >
<cftry>
<cfquery name="randomething" datasource="ds" timeout="590" >
SELECT
col1,
col2
FROM
table
</cfquery>
<cfdump var="#randomething.recordCount#" /> <!--- should be about 5 million rows --->
<cfcatch type="any">
<cfdump var="#cfcatch#">
</cfcatch>
</cftry>
This is just an idea, but you could give it a go:
You mention that using QueryNew you can successfully add the more-than-two-million records you need.
Also that when your maxRows is less than 1,300,000 things work as expected.
So why not first do a query to count(*) the total number of records in the table, divide by a million and round up, then cfloop over that number executing a query with maxRows=1000000 and startRow=((i - 1 * 1000000) + 1) on each iteration...
ArrayAppend each query from within the loop to an array then when it's all done, loop over your array pushing the records into a new Query object. That way you end up with a query at the end containing all the records you were trying to retrieve.
You might hit memory issues, and it will not perform all that well, but hey - this is Coldfusion, those are par for the course, and sometimes crazy things happen / work.
(You could always append the results of each query to the one you're building up from QueryNew as you go rather than pushing each query onto an array, but it'll be easier to debug and see how far you get if it doesn't work if you build an array as you go.)
(Also, using multiple queries within the size that it CF can handle, you may then be able to execute the process you need to by looping over the array and then each query, rather than building up one massive query - would save processing time and memory, but depends on whether you need the full results set in a single Query object or not)
if your date ranges are consistent, i would suggest some aggregate functions in sql instead of having cf process it. something like:
select col1, count(col1), year(col2), month(col2)
from table
group by year(col2), month(col2)
order by year(col2), month(col2)
add day() if you need that detail level, too. you can get really creative with date parts.
this should greatly speed up the entire run time, reduce the main query size.
Your problem here is that ColdFusion cannot time out SQL. This has always been an issue since CF6 I believe. So basically what is happening is that the cfquery is taking longer than 9999999 seconds but CF cannot timeout JDBC so it waits until afterwards tries to run cfdump (which internally uses cfoutput) and this is reported as timing out because the request is now considered to have run too long.
As Adam pointed out, whatever you are trying to do is too large for CF to realistically handle and will either need to be chopped up into smaller jobs or entirely handled in the DB.
So as it turns out the server was running out of memory, apparently cfquery takes up quite a bit more memory than a python list.
It was Barry's comment that got me going in the right direction, I didn't know much about the server monitor up until this point other than the fact that it existed.
As it turns out I am also not very good at reading, the errors that were getting logged in the application.log file were
GC overhead limit exceeded The specific sequence of files included or processed is: \path\to\index.cfm, line: 10 "
and
Java heap space The specific sequence of files included or processed is: \path\to\index.cfm
I'll end up going with Adams suggestion and let the database do the processing. At least now I'll be able to explain why things are slow instead of just saying, "I don't know".

Resources