Google Cloud SQL: Alternative to LOAD DATA INFILE - load-data-infile

I am working in Google App Engine and we have a Python script that dumps data in Google Cloud SQL. One of the data sets we have to dump is huge. We dump around 150K rows of data once a day daily.
I know Google Cloud SQL does not support LOAD DATA INFILE, which I would have normally used. My question is, whether there is an alternative to LOAD DATA INFILE that I can use to speed up the process of data dumping.
Inserting the data normally, without LOAD DATA INFILE, takes about 5 minutes.

As stated in this comment of another question, LOAD DATA LOCAL INFILE is supported by App Engine.
The MySQL Manual explains on how to use this statement.

Things you can do to get better bulk import performance:
Create a .sql file and do an import
Make sure that the insert statements do more than one row at a time. A good rule of thumb is one megabyte per INSERT.
Switch to async replication
Do the import from an App Engine app. The app will be colocated with your Cloud SQL instance, greatly reducing the network latency.

Related

How do I transfer huge amount of data(nearly 10TB) from Oracle DB to Snowflake in hours

How do I transfer huge amount of data(nearly 10TB) from Oracle DB to Snowflake in hours? I see some options like Hevo and fivetran which are paid. However, I need the data to be moved fast so that I need not keep the production system down.
The fastest way to get data into Snowflake is in 10MB to 100MB chunk files. Then, you can leverage a big warehouse to COPY INTO all of the chunk files at one time. I can't speak to how to get the data out of Oracle DB quickly to S3/Azure Blob, though, especially while the system is running its normal workload.
I recommend you look at this document from Snowflake for reference on the Snowflake side: https://docs.snowflake.net/manuals/user-guide/data-load-considerations-prepare.htm
Is there a network speed issue?
Anyways, the data should be compressed when transferred over the network.
There are three locations involved in the staging:
Oracle database,
the extraction client,
and the cloud storage.
You have two data transfers:
between database and client,
and between client and cloud storage.
If the Oracle version is 12cR2 or newer, the DB client can compress data when getting it out of the database. The data should then be compressed again and transferred to cloud storage at your Snowflake destination.
The final step is to load the data from cloud storage into Snowflake (within the same data center)...
Ideally you shouldn't need to keep the production database down. You should be able to categorise the data into
1 - historical data that will not change. You can extract this data at your own leisure, and should not require database to be down.
2 - static data that is fairly stable. You can also extract this data at your leisure
You only need to keep your database fairly stable (not down) when you are extracting the rest of the data. This will require you to build some way to track and validate all your datasets. There is no reason why you couldn't let users continue to read from the database, while you are performing the extract from Oracle.

Is Redis good choice for performing large scale of calculations?

Currently pulling large scale of data from the oracle database & then performing calculation on web side for generating HTML reports. I am using Groovy & Grails frame work for report generation.
Now the problem is , We are having very huge calculation & it takes lots of time to generate report on web side.
I am planning to re-architecture my reports , so it generate reports very quickly.
I don't have any command on ORACLE database as it's third-party production database.
I don't want any replication of the database , because it has millions of records , so can't schedule & replication it slow down the production.
I finally came up with some caching architecture , which perform like some calculation engine.
Anyone can help me by providing best solution ?
Thanks
What is structure of your data? Do you want to query so SQL can help you, or is it binary/document?
Do you need persistence (durability) or not?
Redis is fast. But if you have single threaded app using MS SQL and their bulk importer, it's incredibly fast too.
Redis is key/value stores so you need to perform single SET for every column within your domain object, so it can be slower than any other RDBMS which uses INSERT along with all columns.
Or if your results are in form of JSON object, Mongo can be very useful.
It just depends on your data and purpose of persistence.

Pentaho Spoon microsoft access output performance

i'm trying to populate a database table (access) with the "Microsoft Access Output" step, but I get very bad performance. I mean my data was read from 2 xmls and got merged in 1 minute (36000 rows of data) and the access output is running now for 1 hour with 12 r/s. I set the Commit size high enough to commit all ma data at once (with Commint size 500 I got some java error after the 10th commit).
If I write my file in a csv and import it in access, that is way more quicker, but I would want to automate as much as possible from the process.
Any suggestion is welcome on how to get better performance out of this.
I've never had good luck with the MS-Access output step. It seems to work much better if you create an ODBC entry for your Access DB and load it with a regular Table Output step.
Also you might check that you have the most current version of the Jackcess driver.
Edit: also, for automating flows of data, like writing a CSV and then loading it, you would use a Job. If you're only familiar with transforms, look at the docs on ETL jobs.

How to implement ORACLE to VERTICA replication?

I am in the process of creating an Oracle to Vertica process!
We are looking to create a Vertica DB that will run heavy reports. For now is all cool Vertica is fast space use is great and all well and nice until we get to the main part getting the data from Oracle to Vertica.
OK, initial load is ok, dump to csv from Oracle to Vertica, load times are a joke no problem so far everybody things is bad joke or there's some magic stuff going on! well is Simply Fast.
Bad Part Now -> Databases are up and going ORACLE/VERTICA - and I have data getting altered in ORACLE so I need to replicate my data in VERTICA. What now:
From my tests and from what I can understand about Vertica insert, updates are not to used unless maybe max 20 per sec - so real time replication is out of question.
So I was thinking to read the arch log from oracle and ETL -it to create CSV data with the new data, altered data, deleted values-changed data and then applied it into VERTICA but I can not get a list like this:
Because explicit data change in VERTICA leads to slow performance.
So I am looking for some ideas about how I can solve this issue, knowing I cannot:
Alter my ORACLE production structure.
Use ORACLE env resources for filtering the data.
Cannot use insert, update or delete statements in my VERTICA load process.
Things I depend on:
The use of copy command
Data consistency
A max of 60 min window(every 60 min - new/altered data need to go to VERTICA).
I have seen the Continuent data replication, but it seems that nowbody wants to sell their prod, I cannot get in touch with them.
will loading the whole data to a new table
and then replacing them be acceptable?
copy new() ...
-- you can swap tables in one command:
alter table old,new,swap rename to swap,old,new;
truncate new;
Extract data from Oracle(in .csv format) and load it using Vertica COPY command. Write a simple shell script to automate this process.
I used to use Talend(ETL), but it was very slow then moved to the conventional process and it has really worked for me. Currently processing 18M records, my entire process takes less than 2 min.

What can I do to enhance the performance of bulk data loading using Derby?

I am using Derby In-Memory DB. I need to perform some data loading from csv files in the beginning. For now, it takes about 25 seconds to load all the csv files into their tables. I hope the time can be reduced. Due to the data files are not very large actually.
What I have done is using the built-in procedure from derby.
{CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (?,?,?,',','"','UTF-8',1 )} or
{CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE (?,?,?,',','"','UTF-8',0 )}
The only special thing is sometimes the data in one tables is splitted into many small csv files. So I have to load them one by one.And I have tested if I can combine them together, it will only take 16 seconds. However I cannot remove this feature because it is needed by the user.
Is there anything I can do to reduce the time of loading data? Should I disable log or write some user-defined function/procedure or any other tune can be done? Any advice will be fine.
Thanks!
Use H2 instead of Derby, and use the CSVREAD feature. If that's still too slow, see the fast import optimization, or use the CSV tool directly (without using a database). Disclaimer: I wrote the CSV support for H2.

Resources