How can I limit memory usage when generating a CSV from a large resultset? - oracle

I have a web application in Spring that has a functional requirement for generating a CSV/Excel spreadsheet from a result set coming from a large Oracle database. The expected rows are in the 300,000 - 1,000,000 range. Time to process is not as large of an issue as keeping the application stable -- and right now, very large result sets cause it to run out of memory and crash.
In a normal situation like this, I would use pagination and have the UI display a limited number of results at a time. However, in this case I need to be able to produce the entire set in a single file, no matter how big it might be, for offline use.
I have isolated the issue to the ParameterizedRowMapper being used to convert the result set into objects, which is where I'm stuck.
What techniques might I be able to use to get this operation under control? Is pagination still an option?

A simple answer:
Use a JDBC recordset (or something similar, with an appropriate array/fetch size) and write the data back a LOB, either temporary or back into the database.
Another choice:
Use PL/SQL in the database to write a file using UTL_FILE for your recordset in CSV format. As the file will be on the database server, not on the client, Use UTL_SMTP or JavaMail using Java Stored Procedures to mail the file. After all, I'd be surprised if someone was going to watch the hourglass turn over repeatedly waiting for a 1 million row recordset to be generated.

Instead of loading an entire file in memory you can process each row individually and use output stream to send the output directly to the web browser. E.g. in servlets API, you can get the output stream from ServletResponse.getOutputStream() and then simply write result CSV lines to that stream.

I would push back on those requirements- they sound pretty artificial.
What happens if your application fails, or the power goes out before the user looks at that data?
From your comment above, sounds like you know the answer- you need filesystem or oracle access, in order to do your job.
You are being asked to generate some data- something that is not repeatable by sql?
If it were repeatable, you would just send pages of data back to the user at a time.
Since this report, I'm guessing, has something to do with the current state of your data, you need to store that result somewhere, if you can't stream it out to the user. I'd write a stored procedure in oracle- it's much faster not to send data back and forth across the network. If you have special tools or its just easier, sounds like there's nothing wrong with doing it on the java side instead.
Can you schedule this report to run once a week?

Have you considered the performance of an Excel spreadsheet with 1,000,000 rows?

Related

Serializing query result

I have a financial system with all its business logic located in the database and i have to code an automated workflow for transactions batch processing, which consists of steps listed below:
A user or an external system inserts some data in a table
Before further processing a snapshot of this data in the form of CSV file with a digital signature has to be made. The CSV snapshot itself and its signature have to be saved in the same input table. Program updates successfully signed rows to make them available for further steps of code
...further steps of code
Obvious trouble is step#2: I don't know, how to assign results of a query as a BLOB, that represents a CSV file, to a variable. It seems like some basic stuff, but I couldn't find it. The CSV format was chosen by users, because it is human-readable. Signing itself can be made with a request to external system, so it's not an issue.
Restrictions:
there is no application server, which could process the data, so i have to do it with plsql
there is no way to save a local file, everything must be done on the fly
I know that normally one would do all the work on the application layer or with some local files, but unfortunately this is not the case.
Any help would be highly appreciated, thanks in advance
I agree with #william-robertson. you just need to create a comma delimited values string (assuming header and data row) and write that to a CLOB. I recommend an "insert" trigger. There are lots of SQL tricks you can do to make that easier). On usage of that CSV string will need to be owned by the part of the application that reads it in and needs to do something with it.
I understand yo stated you need to create a CVS, but see if you could do XML instead. Then you could use DBMS_XMLGEN to generate the necessary snapshot into a database column directly from the query for it.
I do not accept the concept that a CVS is human-readable (actually try it sometime as straight text). What is valid is that Excel displays it in human-readable form. But is should also be able to display the XML as human-readable. Further, if needed the data in it can be directly back-ported into the original columns.
Just a alternate idea.

Storing user input (Visual Basic)

I'm creating an application that will take a number of user inputs, store the data for a while, and eventually (at the end of the day) export it to an excel file.
An example might be that a user would input what they did throughout the day. Breakfast/At Home/for 10 minutes. Then later on they would input Coding/At Work/8 hours. Then later on Commuting/Subway/15 minutes. Etc.
I can handle the user interface, and the exporting to excel.
I'm just wondering what might be the best way to store that data and display it back to the user while the program is running. I'm used to working with macros in Excel itself, where I could simply store each row of data in another row on the excel spreadsheet itself.
I would still like a spreadsheet-like display, so that the user can go in to each data point and correct any mistakes. But I am making this as a standalone application using visual basic. Fortunately, I think the ListView or DataGridView tools will let me do this.
At the moment the method I'm thinking of using is simply to store all the user inputs in an array. But I would have to ReDim the array and increase its size each time the user created a new entry.
I can already see a problem with this, however, and that is that an array would have to be constantly stored in active memory. If the user's computer were to crash then all the data would be lost for good.
I'm really a rookie here, so I could use some guidance on how to store a bunch of user inputs like this.
You can use a database file. A local Sql Server Compact Editon database (a single file) that will store your data. You can use Entity Framework to interact with this database.
If you want to use Code First (generate your database from your code) use this:
https://www.codeproject.com/Articles/680116/Code-First-with-SQL-CE
If you want to use Database First (generate your entities from your database) use this:
http://erikej.blogspot.com/2013/11/entity-framework-6-sql-server-compact-4_25.html
You can also use SQLite or other kind database file, but i like SQL Server CE

What is the capacity of a BluePrism Internal Work Queue?

I am working in BluePrism Robotics Process Automation and trying to load an excel sheet with more than 100k records (It might go upwards of 300k in some cases).
I am trying to load internal work queue of BluePrism, but I get an error as quoted below:
'Load Data Into Queue' ERROR: Internal : Exception of type 'System.OutOfMemoryException' was thrown.
Is there a way to avoid this problem, in the way where I can free up more memory?
I plan to process records one by one from queue, and put them into new excel sheets categorically. Loading all that data in a collection and looping over it may be memory consuming, so I am trying to find out a more efficient way.
I welcome any and all help/tips.
Thanks!
Basic Solution:
Break up the number of Excel rows you are pulling into your Collection data item at any one time. The thresholds for this will depend on your resource system memory and architecture, as well as structure and size of the data in the Excel Worksheet. I've been able to quickly move 50k 10-column-rows from Excel to a Collection and then into the Blue Prism queue very quickly.
You can set this up by specifying the Excel Worksheet range to pull into the Collection data item, and then shift that range each time the Collection has been successfully added to the queue.
After each successful addition to the queue and/or before you shift the range and/or at a predefined count limit you can then run a Clean Up or Garbage Collection action to free up memory.
You can do all of this with the provided Excel VBO and an additional Clean Up object.
Keep in mind:
Even breaking it up, looping over a Collection this large to amend the data will be extremely expensive and slow. The most efficient way to make changes to the data will be at the Excel Workbook level or when it is already in the Blue Prism queue.
Best Bet: esqew's alternative solution is the most elegant and probably your best bet.
Jarrick hit it on the nose in that Work Queue items should provide the bot with information on what they are to be working on and a Control Room feedback space, but not the actual work data to be implemented/manipulated.
In this case you would want to just use the items Worksheet row number and/or some unique identifier from a single Worksheet column as the queue item data so that the bot can provide Control Room feedback on the status of the item. If this information is predictable enough in format there should be no need to move any data from the Excel Worksheet to a Collection and then into a Work Queue, but rather simply build the queue based on that data predictability.
Conversely you can also have the bot build the queue "as it happens", in that once it grabs the single row data from the Excel Worksheet to work it, can as well add a queue item with the row number of the data. This will then enable Control Room feedback and tracking. However, this would, in almost every case, be a bad practice as it would not prevent a row from being worked multiple times unless the bot checked the queue first, at which point you've negated the speed gains you were looking to achieve in cutting out the initial queue building in the first place. It would also be impossible to scale the process for multiple bots to work the Excel Worksheet data efficiently.
This is a common issue for RPA, especially if working with large excel files. As far as I know, there are no 100% solutions, but only methods reduce the symptoms. I have run into this problem several times and these are the ways I would try to handle them:
Disable or Errors only for stage logging.
Don`t log parameters on action stages (especially ones that work with the excel files)
Run Garbage collection process
See if it is possible to avoid reading excel files into BP collections and use OLEDB to query the file
See if it is possible to increase the Ram memory on the machines
If they’re using the 32-bit version of the app, then it doesn’t really matter how much memory you feed it, Blue Prism will cap out at 2 GB.
This is may be because of BP Server as the memory is shared between Processes and Work queue.Better option is to use two bots and multiple queues to avoid Memory Error.
If you're using Excel documents or CSV files, you can use the OLEDB object to connect and query against it as if it were a database. You can use the SQL syntax to limit the amount of rows that are returned at a time and paginate through them until you've reached the end of the document.
For starters, you are making incorrect use of the Work Queue in Blue Prism. The Work Queue should not be used to store this type and amount of data. (please read the BP documentation on Work Queues thoroughly).
Solving the issue at hand, being the misuse requires 2 changes:
Only store references in your Item Data which point to the Excel file containing the data.
If you're consulting this much data many times, perhaps convert the file into a CSV, write a VBO that queries the data directly in the CSV.
The first change is not just a recommendation, but as your project progresses and IT Architecture and InfoSec comes into play, it will be mandatory.
As for the CSV VBO, take a look at C#, it will make your life a lot easier than loading all this data into BP (time consuming, unreliable, ...).

Pentaho Spoon microsoft access output performance

i'm trying to populate a database table (access) with the "Microsoft Access Output" step, but I get very bad performance. I mean my data was read from 2 xmls and got merged in 1 minute (36000 rows of data) and the access output is running now for 1 hour with 12 r/s. I set the Commit size high enough to commit all ma data at once (with Commint size 500 I got some java error after the 10th commit).
If I write my file in a csv and import it in access, that is way more quicker, but I would want to automate as much as possible from the process.
Any suggestion is welcome on how to get better performance out of this.
I've never had good luck with the MS-Access output step. It seems to work much better if you create an ODBC entry for your Access DB and load it with a regular Table Output step.
Also you might check that you have the most current version of the Jackcess driver.
Edit: also, for automating flows of data, like writing a CSV and then loading it, you would use a Job. If you're only familiar with transforms, look at the docs on ETL jobs.

Exporting 8million records from Oracle to MongoDB

Now I have an Oracle Database with 8 millions records and I need to move them to MongoDB.
I know how to import some data to MongoDB with JSON file using import command but I want to know that is there a better way to achieve this regarding these issues.
Due to the limit of execution time, how to handle it?
The database is going up every seconds so what's the plan to make sure that every records have been moved.
Due to the limit of execution time, how to handle it?
Don't do it with the JSON export / import. Instead you should write a script that reads the data, transforms into the correct format for MongoDB and then inserts it there.
There are a few reasons for this:
Your tables / collections will not be organized the same way. (If they are, then why are you using MongoDB?)
This will allow you to monitor progress of the operation. In particular you can output to log files every 1000th entry or so to get some progress and be able to recover from failures.
This will test your new MongoDB code.
The database is going up every seconds so what's the plan to make sure that every records have been moved.
There are two strategies here.
Track the entries that are updated and re-run your script on newly updated records until you are caught up.
Write to both databases while you run the script to copy data. Then once you've done the script and everything it up to date, you can cut over to just using MongoDB.
I personally suggest #2, this is the easiest method to manage and test while maintaining up-time. It's still going to be a lot of work, but this will allow the transition to happen.

Resources