ETA: Clarifying context: By default, BATCH_STEP_EXECUTION.EXIT_MESSAGE (populated with error code of failed step) is defined as VARCHAR2(2500).
When a step fails, the error message is typically stack trace on order of 10k - 15k. The first 2500 characters rarely gives insight into problem. Two questions:
1) Can I safely change the column type from VARCHAR2(2500) to VARCHAR2(4000)? Or better still, CLOB?
2) Do I need to make any changes in Spring Batch to say, "It's okay to send exit_message of 4000, or unlimited with CLOB, rather than cutting it off at 2500 characters"?
#KevinKirkpatrick,
I believe you can increase the EXIT_MESSAGE length to varchar2(4000) or change the datatype to a CLOB. Both of these options seem to work based on my testing.
The jobRepository needs to be updated to include the max_varchar-length.
For Example:
<batch:job-repository id="j1" max-varchar-length="100000"/>
I don't know if this is a good solution though. Changing the datatype on a Spring Batch table seems less than ideal.
Related
I do not figure out how to increase the max number of entries per query. I would like to insert a thousand entries per query, and the default value is 100.
According to the doc, the parameter max_partitions_per_insert_block defines the limit of simultaneous entries.
I've tried to modify it from the ClickHouse client, but my insertion still fails :
$ clickhouse-client
my-virtual-machine :) set max_partitions_per_insert_block=1000
*SET* max_partitions_per_insert_block = 1000
Ok.
0 rows in set. Elapsed: 0.001 sec.
Moreover, this is no max_partitions_per_insert_block field in the /etc/clickhouse-server/config.xml file.
After modifying max_partitions_per_insert_block, I've tried to insert my data, but I'm stuck with this error :
infi.clickhouse_orm.database.ServerError: Code: 252, e.displayText() = DB::Exception: Too many partitions for single INSERT block (more than 100). The limit is controlled by 'max_partitions_per_insert_block' setting. Large number of partitions is a common misconception. It will lead to severe negative performance impact, including slow server startup, slow INSERT queries and slow SELECT queries. Recommended total number of partitions for a table is under 1000..10000. Please note, that partitioning is not intended to speed up SELECT queries (ORDER BY key is sufficient to make range queries fast). Partitions are intended for data manipulation (DROP PARTITION, etc). (version 19.5.3.8 (official build))
EDIT: I'm still stuck with this. I cannot even manually set the parameter to the value I want with SET max_partitions_per_insert_block = 1000: the value is changed but goes back to 100 after exiting and reopening clickhouse-client (even with sudo, so it does not look like a permission problem).
I figured it out when reading again the documentation, especially this document. I have recognized in the web profile settings I saw in the system.settings table. I just tried to insert the following in my default's profile, reloaded, and my insert of a thousand entries wen well : <max_partitions_per_insert_block>1000</max_partitions_per_insert_block>
I guess it was obvious for some, but probably not for unexperimented people.
Most likely you should change the partitioning scheme. Each partition generates several files on the file system, which can lead to disruption of the OS. In addition, this may be the cause of long mergers.
Any suggestions on packages (or methodologies) that might help with this? I need to take a ~40MB file we receive weekly and determine what changed from the previous to the current file. Whatever those changes are, then need to be made to a single simple database table. In a previous life I've accomplished similar via Linux "diff" with -Hae parameters, resulting in an "ed script". The contents were then handled by a PERL program, using Tie::File to reference the changed record in the previous file. In an effort to strengthen my Go skills I'm trying to utilize it for this current task. https://github.com/sergi/go-diff looks like it might be the ticket, but I'm not sure "patch" output will quite do what I need (easily).
Fixed width and/or delimited text files are still commonly used, does anyone have any samples or pointers or suggestions on packages that might help in dealing with them in this way?
Are you sure you need the intermediate step? 40 MB is not very much, and your database engine is specialized in handling data like that..
For instance with postgresql just load the new data into a temporary table:
create table temptable (
a varchar,
b varchar,
c varchar
);
copy temptable from '/path/to/csv/newdata.txt' delimiter ',' csv;
Then you can use simple SQL query to get the lines that do not have exact match in the old data, for example:
select *
from temptable t
where not exists (
select 1
from oldtable o
where t.a=o.a and t.b=o.b and t.c=o.c
)
If you did not save the data from previous week's batch, then just remember to copy it to an other table for storing. Now the real question is what you want to do with the information, but you should be able to handle most scenarios.
Oracle sequence values can be cached on the database side through use of the 'Cache' option when creating sequences. For example
CREATE SEQUENCE sequence_name
CACHE 1000;
will cache up to 1000 values for performance.
My question is whether these values can be cached in the the oracle drivers.
In my application I want to pull back a range of sequence values but don't want to have to go back to the database for each new value. I know Hibernate has similar functionality but I've been unable to find out exactly how it's accomplished.
Any advice would be greatly appreciated.
Thanks,
Matt
No, you can not reserve one batch of numbers in one session (if I understood correctly). Setting correct cache value would very likely make this acceptable from performance perspective.
If you still insist you can can create similar functionality yourself - to be able to reserve at once one range of numbers
As mentioned by igr it seems that the oracle drivers cannot cache sequence values in the java layer.
However, I have got round this by setting a large increment on the sequence and generating key values myself. The increment for a sequence can be set as follows:
CREATE SEQUENCE sequence_name
INCREMENT BY $increment;
In my application, every time sequence.nextval is executed I assume that the previous $increment values are reserved, and can be used as unique key values. This means that the database is hit once for every $increment key values that are generated.
So lets say for example that $increment=5000, and we have a starting sequence value of 1. When sequence.nextval is run for the first time the sequence value is incremented to 5001. I then assume that the values 2..5001 are reserved. Then the 5000 values are used in the application (In my use case, they are used for table primary keys), as soon as they are all used up sequence.nextval is run again to reserve another 5000 values, and the process repeated.
The only real downside I can see to this approach is that there is a tiny risk from someone running ddl to modify the $increment between the time the nextval is run and the $increment is used to generate values. Given this is very unlikely, and in my case the ddl will not be updated at runtime, the solution was acceptable.
I realise this doesn't directly answer the question I posed but hopefully it'll be useful to someone else.
Thanks,
Matt.
I have a query that retrieves a large amount of data.
<cfsetting requesttimeout="9999999" >
<cfquery name="randomething" datasource="ds" timeout="9999999" >
SELECT
col1,
col2
FROM
table
</cfquery>
<cfdump var="#randomething.recordCount#" /> <!---should be about 5 million rows --->
I can successfully retrieve the data with python's cx_Oracle and using sys.getsizeof on the python list returns 22621060, so about 21 megabytes.
ColdFusion does not return an error on the page, and I can't find anything in any of the logs. Why is cfdump not showing the number of rows?
Additional Information
The reason for doing it this way is because I have about 8000 smaller queries to run against the randomthing query. In other words when I run those 8000 queries against the database it takes hours for that process to complete. I suspect this is because I am competing with several other database users, and the database is getting bogged down.
The 8000 smaller queries are getting counts of col1 over a period of col2.
SELECT
count(col1) as count
WHERE
col2 < 20121109
AND
col2 > 20121108
According to Adam Cameron's suggestions.
cflog is suggesting that the query isn't finishing.
I tried changing the queries timeout both in the code and in the CFIDE/administrator, apparently CF9 no long respects the timeout attribute, regardless of what I tried I couldn't get the query to timeout.
I also started playing around with the maxrows attribute to see if I could discern any information that way.
when maxrows is set to 1300000 everything works fine
when maxrows is 1400000 or greater I get this error
when maxrows is 2000000 I observe my original problem
Update
So this isn't a limit of cfquery. By using QueryNew then looping over it to add data and I can get well past the 2 million mark without any problems.
I also created a ThinClient datasource using the information in this question, I didn't observe any change in behavior.
The messages on the database end are
SQL*Net message from client
and
SQL*Net more data to client
I just discovered that by using the thin client along with blockfactor1="100" I can retrieve more rows (appx. 3000000).
Is there anything logged on the DB end of things?
I wonder if the timeout is not being respected, and JDBC is "hanging up" on the DB whilst it's working. That's a wild guess. What if you set a very low timeout - eg: 5sec - does it error after 5sec, or what?
The browser could be timing out too. What say you write something to a log before and after the <cfquery> block, with <cflog>. To see if the query is eventually finishing.
I have to wonder what it is you intend to do with these 22M records once you get them back to CF. Whatever it is, it sounds to me like CF is the wrong place to be doing whatever it is: CF ain't for heavy data processing, it's for making web pages. If you need to process 22M records, I suspect you should be doing it on the database. That said, I'm second-guessing what you're doing with no info to go on, so I presume there's probably a good reason to be doing it.
Have you tried wrapping your cfquery within cftry tags to see if that reports anything?
<cfsetting requesttimeout="600" >
<cftry>
<cfquery name="randomething" datasource="ds" timeout="590" >
SELECT
col1,
col2
FROM
table
</cfquery>
<cfdump var="#randomething.recordCount#" /> <!--- should be about 5 million rows --->
<cfcatch type="any">
<cfdump var="#cfcatch#">
</cfcatch>
</cftry>
This is just an idea, but you could give it a go:
You mention that using QueryNew you can successfully add the more-than-two-million records you need.
Also that when your maxRows is less than 1,300,000 things work as expected.
So why not first do a query to count(*) the total number of records in the table, divide by a million and round up, then cfloop over that number executing a query with maxRows=1000000 and startRow=((i - 1 * 1000000) + 1) on each iteration...
ArrayAppend each query from within the loop to an array then when it's all done, loop over your array pushing the records into a new Query object. That way you end up with a query at the end containing all the records you were trying to retrieve.
You might hit memory issues, and it will not perform all that well, but hey - this is Coldfusion, those are par for the course, and sometimes crazy things happen / work.
(You could always append the results of each query to the one you're building up from QueryNew as you go rather than pushing each query onto an array, but it'll be easier to debug and see how far you get if it doesn't work if you build an array as you go.)
(Also, using multiple queries within the size that it CF can handle, you may then be able to execute the process you need to by looping over the array and then each query, rather than building up one massive query - would save processing time and memory, but depends on whether you need the full results set in a single Query object or not)
if your date ranges are consistent, i would suggest some aggregate functions in sql instead of having cf process it. something like:
select col1, count(col1), year(col2), month(col2)
from table
group by year(col2), month(col2)
order by year(col2), month(col2)
add day() if you need that detail level, too. you can get really creative with date parts.
this should greatly speed up the entire run time, reduce the main query size.
Your problem here is that ColdFusion cannot time out SQL. This has always been an issue since CF6 I believe. So basically what is happening is that the cfquery is taking longer than 9999999 seconds but CF cannot timeout JDBC so it waits until afterwards tries to run cfdump (which internally uses cfoutput) and this is reported as timing out because the request is now considered to have run too long.
As Adam pointed out, whatever you are trying to do is too large for CF to realistically handle and will either need to be chopped up into smaller jobs or entirely handled in the DB.
So as it turns out the server was running out of memory, apparently cfquery takes up quite a bit more memory than a python list.
It was Barry's comment that got me going in the right direction, I didn't know much about the server monitor up until this point other than the fact that it existed.
As it turns out I am also not very good at reading, the errors that were getting logged in the application.log file were
GC overhead limit exceeded The specific sequence of files included or processed is: \path\to\index.cfm, line: 10 "
and
Java heap space The specific sequence of files included or processed is: \path\to\index.cfm
I'll end up going with Adams suggestion and let the database do the processing. At least now I'll be able to explain why things are slow instead of just saying, "I don't know".
I am trying to understand the performance of a query that I've written in Oracle. At this time I only have access to SQLDeveloper and its execution timer. I can run SHOW PLAN but cannot use the auto trace function.
The query that I've written runs in about 1.8 seconds when I press "execute query" (F9) in SQLDeveloper. I know that this is only fetching the first fifty rows by default, but can I at least be certain that the 1.8 seconds encompasses the total execution time plus the time to deliver the first 50 rows to my client?
When I wrap this query in a stored procedure (returning the results via an OUT REF CURSOR) and try to use it from an external application (SQL Server Reporting Services), the query takes over one minute to run. I get similar performance when I press "run script" (F5) in SQLDeveloper. It seems that the difference here is that in these two scenarios, Oracle has to transmit all of the rows back rather than the first 50. This leads me to believe that there is some network connectivity issues between the client PC and Oracle instance.
My query only returns about 8000 rows so this performance is surprising. To try to prove my theory above about the latency, I ran some code like this in SQLDeveloper:
declare
tmp sys_refcursor;
begin
my_proc(null, null, null, tmp);
end;
...And this runs in about two seconds. Again, does SQLDeveloper's execution clock accurately indicate the execution time of the query? Or am I missing something and is it possible that it is in fact my query which needs tuning?
Can anybody please offer me any insight on this based on the limited tools I have available? Or should I try to involve the DBA to do some further analysis?
"I know that this is only fetching the
first fifty rows by default, but can I
at least be certain that the 1.8
seconds encompasses the total
execution time plus the time to
deliver the first 50 rows to my
client?"
No, it is the time to return the first 50 rows. It doesn't necessarily require that the database has determined the entire result set.
Think about the table as an encyclopedia. If you want a list of animals with names beginning with 'A' or 'Z', you'll probably get Aardvarks and Alligators pretty quickly. It will take much longer to get Zebras as you'd have to read the entire book. If your query is doing a full table scan, it won't complete until it has read the entire table (or book), even if there is nothing to be picked up in anything after the first chapter (because it doesn't know there isn't anything important in there until it has read it).
declare
tmp sys_refcursor;
begin
my_proc(null, null, null, tmp);
end;
This piece of code does nothing. More specifically, it will parse the query to determine that the necessary tables, columns and privileges are in place. It will not actually execute the query or determine whether any rows meet the filter criteria.
If the query only returns 8000 rows it is unlikely that the network is a significant problem (unless they are very big rows).
Ask your DBA for a quick tutorial in performance tuning.