Pentaho Spoon microsoft access output performance - performance

i'm trying to populate a database table (access) with the "Microsoft Access Output" step, but I get very bad performance. I mean my data was read from 2 xmls and got merged in 1 minute (36000 rows of data) and the access output is running now for 1 hour with 12 r/s. I set the Commit size high enough to commit all ma data at once (with Commint size 500 I got some java error after the 10th commit).
If I write my file in a csv and import it in access, that is way more quicker, but I would want to automate as much as possible from the process.
Any suggestion is welcome on how to get better performance out of this.

I've never had good luck with the MS-Access output step. It seems to work much better if you create an ODBC entry for your Access DB and load it with a regular Table Output step.
Also you might check that you have the most current version of the Jackcess driver.
Edit: also, for automating flows of data, like writing a CSV and then loading it, you would use a Job. If you're only familiar with transforms, look at the docs on ETL jobs.

Related

Two ETL on same source data at same time

Sorry the really newbie question but I want to get a simple answer.
We have our data come in to us and it gets stored as raw data. Currently we have an ICT team who have produced an ETL process that does a lot of work turning it into something usable for analysis but this takes a lot of time. The issue is that the time taken means that the data quality report that needs to go back is often late and prevents issues from being resolved. As DQ needs to be done on raw data and not cleaned, can I run build a second ETL process to create the DQ tables.
By this I mean can they both be run simultaneously as the main ETL will be an Oracle process using the raw data as a data source and the one I want to build will also use raw data in Oracle as a data source using SSIS.
I know it's probably a simple question, but as it's not something I do on a day to day basis so I don't know if it can be done this way or if one will have to wait until the other has finished.
We are storing the data in Oracle 11g and doing the major process there. I'll be using SQL 2008 R2 to what I want.
Thanks for your help

How to implement ORACLE to VERTICA replication?

I am in the process of creating an Oracle to Vertica process!
We are looking to create a Vertica DB that will run heavy reports. For now is all cool Vertica is fast space use is great and all well and nice until we get to the main part getting the data from Oracle to Vertica.
OK, initial load is ok, dump to csv from Oracle to Vertica, load times are a joke no problem so far everybody things is bad joke or there's some magic stuff going on! well is Simply Fast.
Bad Part Now -> Databases are up and going ORACLE/VERTICA - and I have data getting altered in ORACLE so I need to replicate my data in VERTICA. What now:
From my tests and from what I can understand about Vertica insert, updates are not to used unless maybe max 20 per sec - so real time replication is out of question.
So I was thinking to read the arch log from oracle and ETL -it to create CSV data with the new data, altered data, deleted values-changed data and then applied it into VERTICA but I can not get a list like this:
Because explicit data change in VERTICA leads to slow performance.
So I am looking for some ideas about how I can solve this issue, knowing I cannot:
Alter my ORACLE production structure.
Use ORACLE env resources for filtering the data.
Cannot use insert, update or delete statements in my VERTICA load process.
Things I depend on:
The use of copy command
Data consistency
A max of 60 min window(every 60 min - new/altered data need to go to VERTICA).
I have seen the Continuent data replication, but it seems that nowbody wants to sell their prod, I cannot get in touch with them.
will loading the whole data to a new table
and then replacing them be acceptable?
copy new() ...
-- you can swap tables in one command:
alter table old,new,swap rename to swap,old,new;
truncate new;
Extract data from Oracle(in .csv format) and load it using Vertica COPY command. Write a simple shell script to automate this process.
I used to use Talend(ETL), but it was very slow then moved to the conventional process and it has really worked for me. Currently processing 18M records, my entire process takes less than 2 min.

Exporting 8million records from Oracle to MongoDB

Now I have an Oracle Database with 8 millions records and I need to move them to MongoDB.
I know how to import some data to MongoDB with JSON file using import command but I want to know that is there a better way to achieve this regarding these issues.
Due to the limit of execution time, how to handle it?
The database is going up every seconds so what's the plan to make sure that every records have been moved.
Due to the limit of execution time, how to handle it?
Don't do it with the JSON export / import. Instead you should write a script that reads the data, transforms into the correct format for MongoDB and then inserts it there.
There are a few reasons for this:
Your tables / collections will not be organized the same way. (If they are, then why are you using MongoDB?)
This will allow you to monitor progress of the operation. In particular you can output to log files every 1000th entry or so to get some progress and be able to recover from failures.
This will test your new MongoDB code.
The database is going up every seconds so what's the plan to make sure that every records have been moved.
There are two strategies here.
Track the entries that are updated and re-run your script on newly updated records until you are caught up.
Write to both databases while you run the script to copy data. Then once you've done the script and everything it up to date, you can cut over to just using MongoDB.
I personally suggest #2, this is the easiest method to manage and test while maintaining up-time. It's still going to be a lot of work, but this will allow the transition to happen.

Transfer large amount of data from DB2 to Oracle?

I need every day to transfer large amounts of data (about several millions records) from db2 to oracle database. Could u suggest the best perfoming method to do that?
DB2 will allow you to select Oracle as a replication target. This is probably the most efficient and easiest way to do it every day, it also removes the "intermediate container" objection that you have.
See this introduction (and more from the documentation online) for more.
If you're only talking speed then do this.
Time how long it takes to dump the DB2 data to a flatfile.
Time how long it takes to suck that flatfile into Oracle.
there's your baseline and it's free. If you can beat that with an ETL tool, then decide if the cost of the tool is worth it.
For a simple ETL like this, there's little that I've found that can beat this on time.
The downside of this is just general file manipulation BS...
how do you know when to read from the file
how do you know that you got all the rows
how do you resume when something breaks
All those little "niceties" usually get paid for with speed. Of course, I'm joking a bit. They aren't always a little nicety. They are often essential for a smooth running process.
Dump out data to delimited file. Load to Oracle via DIRECT load sqlldr job. Not sexy, but fast. If you can be on same subnet that would be best (pushing data across the network is not what you want). Set this up on a cron, add email alerts on errors

How can I limit memory usage when generating a CSV from a large resultset?

I have a web application in Spring that has a functional requirement for generating a CSV/Excel spreadsheet from a result set coming from a large Oracle database. The expected rows are in the 300,000 - 1,000,000 range. Time to process is not as large of an issue as keeping the application stable -- and right now, very large result sets cause it to run out of memory and crash.
In a normal situation like this, I would use pagination and have the UI display a limited number of results at a time. However, in this case I need to be able to produce the entire set in a single file, no matter how big it might be, for offline use.
I have isolated the issue to the ParameterizedRowMapper being used to convert the result set into objects, which is where I'm stuck.
What techniques might I be able to use to get this operation under control? Is pagination still an option?
A simple answer:
Use a JDBC recordset (or something similar, with an appropriate array/fetch size) and write the data back a LOB, either temporary or back into the database.
Another choice:
Use PL/SQL in the database to write a file using UTL_FILE for your recordset in CSV format. As the file will be on the database server, not on the client, Use UTL_SMTP or JavaMail using Java Stored Procedures to mail the file. After all, I'd be surprised if someone was going to watch the hourglass turn over repeatedly waiting for a 1 million row recordset to be generated.
Instead of loading an entire file in memory you can process each row individually and use output stream to send the output directly to the web browser. E.g. in servlets API, you can get the output stream from ServletResponse.getOutputStream() and then simply write result CSV lines to that stream.
I would push back on those requirements- they sound pretty artificial.
What happens if your application fails, or the power goes out before the user looks at that data?
From your comment above, sounds like you know the answer- you need filesystem or oracle access, in order to do your job.
You are being asked to generate some data- something that is not repeatable by sql?
If it were repeatable, you would just send pages of data back to the user at a time.
Since this report, I'm guessing, has something to do with the current state of your data, you need to store that result somewhere, if you can't stream it out to the user. I'd write a stored procedure in oracle- it's much faster not to send data back and forth across the network. If you have special tools or its just easier, sounds like there's nothing wrong with doing it on the java side instead.
Can you schedule this report to run once a week?
Have you considered the performance of an Excel spreadsheet with 1,000,000 rows?

Resources