Exporting 8million records from Oracle to MongoDB - oracle

Now I have an Oracle Database with 8 millions records and I need to move them to MongoDB.
I know how to import some data to MongoDB with JSON file using import command but I want to know that is there a better way to achieve this regarding these issues.
Due to the limit of execution time, how to handle it?
The database is going up every seconds so what's the plan to make sure that every records have been moved.

Due to the limit of execution time, how to handle it?
Don't do it with the JSON export / import. Instead you should write a script that reads the data, transforms into the correct format for MongoDB and then inserts it there.
There are a few reasons for this:
Your tables / collections will not be organized the same way. (If they are, then why are you using MongoDB?)
This will allow you to monitor progress of the operation. In particular you can output to log files every 1000th entry or so to get some progress and be able to recover from failures.
This will test your new MongoDB code.
The database is going up every seconds so what's the plan to make sure that every records have been moved.
There are two strategies here.
Track the entries that are updated and re-run your script on newly updated records until you are caught up.
Write to both databases while you run the script to copy data. Then once you've done the script and everything it up to date, you can cut over to just using MongoDB.
I personally suggest #2, this is the easiest method to manage and test while maintaining up-time. It's still going to be a lot of work, but this will allow the transition to happen.

Related

Spring Boot application with Postgres: indexes not being used during first use

I have a Spring Boot application that is using a Postgres database. When the application is deployed I need to run a transactional operation that uploads a zip file that is used to populate the database. The application is checking for duplicate rows before inserting them (because users can upload duplicate data that should just be ignored).
The problem I am having is that the first time I upload the file, even thought the indexes are created, they are not being used when checking for the existence of a row. My theory is that this happens because the query plan is deciding not to use the index because it is checking the original statistics, which show that the tables are empty. If I upload a small zip file first, then the problem goes away because the tables now have data.
I have two questions. First, is my theory correct or is there some other reason for this behaviour? Also, if so, is there a way to force Postgres to update the query plan it uses at some predefined interval within the same transaction and can this be done using JPA? Any ideas are appreciated.
Just in case someone runs into this issue, I'll post the solution I found. It appears my theory was correct. The queries will not use the indexes until some statistics are collected. One way to force this is to call ANALYZE after a number of rows have been written to the database. You can do this using a native query like this:
entityManager.createNativeQuery("ANALYZE " + tbl).executeUpdate();
You can wrap this call in a try catch and ignore any exceptions that might occur if you change the database engine. I couldn't find a way of doing this in a database-independent way but this approach works fine and now the initial upload performs as expected.

Realistic Data Backup method for Parse.com

We are building an iOS app with Parse.com, but still can't figure out the right way to backup data efficiently.
As a premise, we have and will have a LOT of data store rows.
Say we have a class with 1million rows, assume we have it backed up, then want to bring it back to Parse, after a hazardous situation (like data loss on production).
The few solutions we have considered are the following:
1) Use external server for backup
BackUp:
- use the REST API to constantly back up data to a remote MySQL server (we chose MySQL for customized analytics purpose, since it's way faster and easier to handle data with MySQL for us)
ImportBack:
a) - recreate JSON objects from MySQL backup and use the REST API to send back to Parse.
Say we use the batch operation which permits 50 simultaneous objects to be created with 1 query, and assume it takes 1 sec for every query, 1million data sets will take 5.5hours to transfer to Parse.
b) - recreate one JSON file from MySQL backup and use the Dashboard to import data manually.
We just tried with 700,000 records file with this method: it took about 2 hours for the loading indicator to stop and show the number of rows in the left pane, but now it never opens in the right pane (it says "operation time out") and it's over 6hours since the upload started.
So we can't rely on 1.b, and 1.a seems to take too long to recover from a disaster (if we have 10 million records, it'll be like 55 hours = 2.2 days).
Now we are thinking about the following:
2) Constantly replicate data to another app
Create the following in Parse:
- Production App: A
- Replication App: B
So while A is in production, every single query will be duplicated to B (using background job constantly).
The downside is of course that it'll eat up the burst limit of A as it'll simply double the amount of query. So not ideal thinking of scaling up.
What we want is something like AWS RDS which gives an option to automatically backup daily.
I wonder how this could be difficult for Parse since it's based on AWS infra.
Please let me know if you have any idea on this, will be happy to share know-hows.
P.S.:
We’ve noticed an important flaw in the above 2) idea.
If we replicate using REST API, all the objectIds of all Classes will be changed, so every 1to1 or 1toMany relations will be broken.
So we think about putting a uuid for every object class.
Is there any problem about this method?
One thing we want to achieve is
query.include(“ObjectName”)
( or in Obj-C “includeKey”),
but I suppose that won’t be possible if we don’t base our app logic on objectId.
Looking for a work around for this issue;
but will uuid-based management be functional under Parse’s Datastore logic?
Parse has never lost production data. While we don't currently offer automated backups, you can request one any time you like, and we're working on making all of this even nicer. Additionally, it's easier in most cases to import the JSON export file through the data browser rather than using the REST batch.
I can confirm that today, Parse did lost my data. Or at least it appeared to be so.
After several errors where detected on multiple apps (agreed by Parse Status twitter account), we could not retrieve data for an app, without any error.
It was because an entire column of one of our class (type pointer) disappeared and data was not present anymore in the dashboard.
We are using this pointer column to filter / retrieve data, so the returned queries and collections were empty.
So we decided to recreate the column manually. By chance, recreating the column, with the same name and type, solved the issue and the data was still there... I can't explain it but I really thought, and the app reacted as if, data were lost.
So an automated backup and restore option is mandatory, it is not an option.
On December 2015 parse.com released a new dashboard with an improved export feature.
Just select your app, click on "App Settings" -> "General" -> "Export app data". Parse generates a json-file for every class in your app and sends an email to you, if the export-progress is done.
UPDATE:
Sad but true, parse.com is winding down: http://blog.parse.com/announcements/moving-on/
I had the same issue of backing up parse server data. As parse server is using mongodb that is why backing up data is not an issue I have just done a simple thing. downloaded the mongodb backup from the server. And then restored it using
mongorestore /path-to-mongodump (extracted files)
As parse has been turned to open source.Therefore we can adopt this technique.
For accidental deletes, writing a cloud function 'beforedelete' to backup the current row to another class would work.
For regular backups, manual export of changed records (use filter) will be useful. For recovery this requires you to write scripts / use import option (not so sure) in data browser. You could also write a cloud function replicate data on your backup server (haven't tried this yet).
However there are some limitations to cloud code that you should consider before venturing into it:
https://parse.com/docs/cloud_code_guide#functions-resource

Pentaho Spoon microsoft access output performance

i'm trying to populate a database table (access) with the "Microsoft Access Output" step, but I get very bad performance. I mean my data was read from 2 xmls and got merged in 1 minute (36000 rows of data) and the access output is running now for 1 hour with 12 r/s. I set the Commit size high enough to commit all ma data at once (with Commint size 500 I got some java error after the 10th commit).
If I write my file in a csv and import it in access, that is way more quicker, but I would want to automate as much as possible from the process.
Any suggestion is welcome on how to get better performance out of this.
I've never had good luck with the MS-Access output step. It seems to work much better if you create an ODBC entry for your Access DB and load it with a regular Table Output step.
Also you might check that you have the most current version of the Jackcess driver.
Edit: also, for automating flows of data, like writing a CSV and then loading it, you would use a Job. If you're only familiar with transforms, look at the docs on ETL jobs.

How to implement ORACLE to VERTICA replication?

I am in the process of creating an Oracle to Vertica process!
We are looking to create a Vertica DB that will run heavy reports. For now is all cool Vertica is fast space use is great and all well and nice until we get to the main part getting the data from Oracle to Vertica.
OK, initial load is ok, dump to csv from Oracle to Vertica, load times are a joke no problem so far everybody things is bad joke or there's some magic stuff going on! well is Simply Fast.
Bad Part Now -> Databases are up and going ORACLE/VERTICA - and I have data getting altered in ORACLE so I need to replicate my data in VERTICA. What now:
From my tests and from what I can understand about Vertica insert, updates are not to used unless maybe max 20 per sec - so real time replication is out of question.
So I was thinking to read the arch log from oracle and ETL -it to create CSV data with the new data, altered data, deleted values-changed data and then applied it into VERTICA but I can not get a list like this:
Because explicit data change in VERTICA leads to slow performance.
So I am looking for some ideas about how I can solve this issue, knowing I cannot:
Alter my ORACLE production structure.
Use ORACLE env resources for filtering the data.
Cannot use insert, update or delete statements in my VERTICA load process.
Things I depend on:
The use of copy command
Data consistency
A max of 60 min window(every 60 min - new/altered data need to go to VERTICA).
I have seen the Continuent data replication, but it seems that nowbody wants to sell their prod, I cannot get in touch with them.
will loading the whole data to a new table
and then replacing them be acceptable?
copy new() ...
-- you can swap tables in one command:
alter table old,new,swap rename to swap,old,new;
truncate new;
Extract data from Oracle(in .csv format) and load it using Vertica COPY command. Write a simple shell script to automate this process.
I used to use Talend(ETL), but it was very slow then moved to the conventional process and it has really worked for me. Currently processing 18M records, my entire process takes less than 2 min.

How can I limit memory usage when generating a CSV from a large resultset?

I have a web application in Spring that has a functional requirement for generating a CSV/Excel spreadsheet from a result set coming from a large Oracle database. The expected rows are in the 300,000 - 1,000,000 range. Time to process is not as large of an issue as keeping the application stable -- and right now, very large result sets cause it to run out of memory and crash.
In a normal situation like this, I would use pagination and have the UI display a limited number of results at a time. However, in this case I need to be able to produce the entire set in a single file, no matter how big it might be, for offline use.
I have isolated the issue to the ParameterizedRowMapper being used to convert the result set into objects, which is where I'm stuck.
What techniques might I be able to use to get this operation under control? Is pagination still an option?
A simple answer:
Use a JDBC recordset (or something similar, with an appropriate array/fetch size) and write the data back a LOB, either temporary or back into the database.
Another choice:
Use PL/SQL in the database to write a file using UTL_FILE for your recordset in CSV format. As the file will be on the database server, not on the client, Use UTL_SMTP or JavaMail using Java Stored Procedures to mail the file. After all, I'd be surprised if someone was going to watch the hourglass turn over repeatedly waiting for a 1 million row recordset to be generated.
Instead of loading an entire file in memory you can process each row individually and use output stream to send the output directly to the web browser. E.g. in servlets API, you can get the output stream from ServletResponse.getOutputStream() and then simply write result CSV lines to that stream.
I would push back on those requirements- they sound pretty artificial.
What happens if your application fails, or the power goes out before the user looks at that data?
From your comment above, sounds like you know the answer- you need filesystem or oracle access, in order to do your job.
You are being asked to generate some data- something that is not repeatable by sql?
If it were repeatable, you would just send pages of data back to the user at a time.
Since this report, I'm guessing, has something to do with the current state of your data, you need to store that result somewhere, if you can't stream it out to the user. I'd write a stored procedure in oracle- it's much faster not to send data back and forth across the network. If you have special tools or its just easier, sounds like there's nothing wrong with doing it on the java side instead.
Can you schedule this report to run once a week?
Have you considered the performance of an Excel spreadsheet with 1,000,000 rows?

Resources