jdbc template batch update with snowflake database is very slow - performance

I have a spring boot application that connects to the Snowflake Database and uploads records (approx 50 columns of different data types). I am using
JdbcTemplate.batchUpdate(insertSql, values, types)
to do the bulk insert. Currently, it is consuming around 100 seconds for 50,000 records. I want to improve the batch performance. but not able to find an optimal solution.
I referred to and tried the solution mentioned in this post, but it didn't help at all. Any suggestions will be highly appreciated

I moved away from batch insert to snowflake copy command using JDBC. It is lightning fast. With the copy command, it is barely taking 2-3 seconds to load 50000 records from a CSV file with XS (extra small) size Dataware house.
Moreover, in case of error, messages are very clear and can be viewed in information_schema.load_history. Different file formats can be loaded and there are a variety of options to customize load process.
In my case, I am first loading the CSV file to the internal staging area (takes less than 1 sec), Run Copy command (takes 1-2 seconds), verifying load status in information_schema.load_history table (takes a few milliseconds) and proceed accordingly
This article was also helpful for running copy command with JDBC

Related

How to make a Spotfire link open faster?

I've published a Spotfire file with 70 '.txt' files linked to it. The total size of the files is around 2Gb. when the users open it in their web browser it takes + - 27 minutes to load the linked tables.
I need an option that enhances opening performance. The issue seems to be the aumont of data and the way they are linked to Spotfire.
This runs in a server and the users open the BI in their browser.
I've tryed to embeed the data, it lowers the time, but forces me to interact with the software every time I want to update the data. The solution is supposed to run automatically.
I need to open this in less than 5 minutes.
Update:
- I need the data to be updated at least twice a day.
- The embedded link is acceptable from the time perspective, but the system need to run without my intetrvention.
- I've never used Spotfire automation services.
Schedule the report to cache twice a day on the Spotfire server by setting up a rule under scheduling and routing. The good thing about this is while it is updating the analysis for the second time during the day, it will still allow users to quickly open older data until it is complete. To the end user it will open in seconds but behind the scenes you have just pre-opened the report. Once you set up the rule this will run automatically with no intervention needed.
All functionality and scripting within the report will work the same, and it can be opened up many times at the same time from different users. This is really the best way if you have to link to that many files. Otherwise, try collapsing files, aggregating data, removing all unnecessary columns and data tables for the data to pull through faster.

Hadoop vs Cassandra: Which is better for the following scenario?

There is a situation in our systems in which the user can view and "close" a report. After they close it, the report is moved to a temporary table inside the database where it is kept for 24 hrs, and then moved to an archives table(where the report is stored for next 7 years). At any point during the 7 years, a user can "reopen" the report and work on it. The problem is that archives storage is getting large and finding/reopening reports tend to be time consuming. And I need to get statistics on the archives from time to time(i.e. report dates, clients, average length "opened", etc). I want to use a big data approach but I am not sure whether to use Hadoop, Cassandra, or something else ? Can someone provide me with some guidelines how to get started and decide on what to use ?
If you archive is large and you'd like to get reports from it, you won't be able to use just Cassandra, as it has no easy means of aggregating the data. You'll end up collocating Hadoop and Cassandra on the same nodes.
From my experience archives (write once - read many) is not the best use case for Cassandra if you're having a lot of writes (we've tried it for a backend for a backup sysyem). Depending on your compaction strategy you'll pay either in space or in iops for having that. Added changes are propagated through the SSTable hierarchies resulting in a lot more writes than the original change.
It is not possible to answer your question in full without knowing other variables: how much hardware (servers, their ram/cpu/hdd/ssd) are you going to allocate? what is the size of each 'report' entry? how many reads / writes you usually serve daily? How large is your archive storage now?
Cassandra might work fine. Keep two tables, reports and reports_archive. Define the schema using a TTL of 24 hours and 7 years:
CREATE TABLE reports (
...
) WITH default_time_to_live = 86400;
CREATE TABLE reports_archive (
...
) WITH default_time_to_live = 86400 * 365 * 7;
Use the new Time Window Compaction Strategy (TWCS) to minimize write amplification. It could be advantageous to store the report metadata and report binary data in separate tables.
For roll-up analytics, use Spark with Cassandra. You don't mention the size of your data, but roughly speaking 1-3 TB per Cassandra node should work fine. Using RF=3 you'll need at least three nodes.

Pentaho Spoon microsoft access output performance

i'm trying to populate a database table (access) with the "Microsoft Access Output" step, but I get very bad performance. I mean my data was read from 2 xmls and got merged in 1 minute (36000 rows of data) and the access output is running now for 1 hour with 12 r/s. I set the Commit size high enough to commit all ma data at once (with Commint size 500 I got some java error after the 10th commit).
If I write my file in a csv and import it in access, that is way more quicker, but I would want to automate as much as possible from the process.
Any suggestion is welcome on how to get better performance out of this.
I've never had good luck with the MS-Access output step. It seems to work much better if you create an ODBC entry for your Access DB and load it with a regular Table Output step.
Also you might check that you have the most current version of the Jackcess driver.
Edit: also, for automating flows of data, like writing a CSV and then loading it, you would use a Job. If you're only familiar with transforms, look at the docs on ETL jobs.

AWS Redshift JDBC insert performance

I am writing a proof-of-concept app which is intended to take live clickstream data at the rate of around 1000 messages per second and write it to Amazon Redshift.
I am struggling to get anything like the performance some others claim (for example, here).
I am running a cluster with 2 x dw.hs1.xlarge nodes (+ leader), and the machine that is doing the load is an EC2 m1.xlarge instance on the same VPC as the Redshift cluster running 64 bit Ubuntu 12.04.1.
I am using Java 1.7 (openjdk-7-jdk from the Ubuntu repos) and the Postgresql 9.2-1002 driver (principally because it's the only one in Maven Central which makes my build easier!).
I've tried all the techniques shown here, except the last one.
I cannot use COPY FROM because we want to load data in "real time", so staging it via S3 or DynamoDB isn't really an option, and Redshift doesn't support COPY FROM stdin for some reason.
Here is an excerpt from my logs showing that individual rows are being inserted at the rate of around 15/second:
2013-05-10 15:05:06,937 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 170
2013-05-10 15:05:18,707 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:05:18,708 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 712
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 167
2013-05-10 15:06:14,381 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
What am I doing wrong? What other approaches could I take?
Redshift (aka ParAccel) is an analytic database. The goal is enable analytic queries to be answered quickly over very large volumes of data. To that end Redshift stores data in a columnar format. Each column is held separately and compressed against the previous values in the column. This compression tends to be very effective because a given column usually holds many repetitive and similar data.
This storage approach provides many benefits at query time because only the requested columns need to be read and the data to be read is very compressed. However, the cost of this is that inserts tend to be slower and require much more effort. Also inserts that are not perfectly ordered may result in poor query performance until the tables are VACUUM'ed.
So, by inserting a single row at a time you are completely working against the the way that Redshift works. The database is has to append your data to each column in succession and calculate the compression. It's a little bit (but not exactly) like adding a single value to large number of zip archives. Additionally, even after your data is inserted you still won't get optimal performance until you run VACUUM to reorganise the tables.
If you want to analyse your data in "real time" then, for all practical purposes, you should probably choose another database and/or approach. Off the top of my head here are 3:
Accept a "small" batching window (5-15 minutes) and plan to run VACUUM at least daily.
Choose an analytic database (more $) which copes with small inserts, e.g., Vertica.
Experiment with "NoSQL" DBs that allow single path analysis, e.g., Acunu Cassandra.
The reason single inserts are slow is the way Redshift handles commits. Redshift has a single queue for commit.
Say you insert row 1, then commit - it goes to the redshift commit queue to finish commit.
Next row , row 2, then commit - again goes to the commit queue. Say during this time if the commit of row 1 is not complete, row 2 waits for the commit of 1 to complete and then gets started to work on row 2 commit.
So if you batch your inserts, it does a single commit and is faster than single commits to the Redshift system.
You can get commit queue information via the issue Tip #9: Maintaining efficient data loads in the link below.
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
We have been able to insert 1000 rows / sec in Redshift by batching several requests together in the same INSERT statement (in our case we had to batch ~200 value tuples in each INSERT). If you use an ORM layer like Hibernate, you can configure it for batching (eg see http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html)
I've been able to achieve 2,400 inserts/second by batching writes into transactions of 75,000 records per transaction. Each record is small, as you might expect, being only about 300 bytes per record.
I'm querying a MariaDB installed on an EC2 instance and inserting the records into RedShift from the same EC2 instance that Maria is installed on.
UPDATE
I modified the way I was doing writes so that it loads the data from MariaDB in 5 parallel threads and writes to RedShift from each thread. That increased performance to 12,000+ writes/second.
So yeah, if you plan it correctly you can get great performance from RedShift writes.

SSIS File Load WAY TOO SLOW in Large Destination Table

this is my first question, I've searched a lot of info from different sites but none of them where conslusive.
Problem:
Daily I'm loading a flat file with an SSIS Package executed in a scheduled job in SQL Server 2005 but it's taking TOO MUCH TIME(like 2 1/2 hours) and the file just has like 300 rows and its a 50 MB file aprox. This is driving me crazy, because is affecting the performance of my server.
This is the Scenario:
-My package is just a Data Flow Task that has a Flat File Source and an OLE DB Destination, thats all!!!
-The Data Access Mode is set to FAST LOAD.
-Just have 3 indexes in the table and are nonclustered.
-My destination table has 366,964,096 records so far and 32 columns
-I haven't set FastParse in any of the Output columns yet.(want to try something else first)
So I've just started to make some tests:
-Rebuild/Reorganize the indexes in the destination table(they where way too fragmented), but this didn't help me much
-Created another table with the same structure but whitout all the indexes and executed the Job with the SSIS package loading to this new table and IT JUST TOOK LIKE 1 MINUTE !!!
So I'm confused, is there something I'm Missing???
-Is the SSIS package writing all the large table in a Buffer and the writing it on Disk? Or why the BIG difference in time ?
-Is the index affecting the insertion time?
-Should I load the file to this new table as a temporary table and then do a BULK INSERT to the destination table with the records ordered? 'Cause I though that the Data FLow Task was much faster than BULK INSERT, but at this point I don't know now.
Greetings in advance.
One thing I might look at is if the large table has any triggers which are causing it to be slower on insert. Also if the clustered index is on a field that will require a good bit of rearranging of the data during the load, that could cause an issues as well.
In SSIS packages, using a merge join (which requires sorting) can cause slownesss, but from your description it doesn't appear you did that. I mention it only in case you were doing that and didn't mention it.
If it works fine without the indexes, perhaps you should look into those. What are the data types? How many are there? Maybe you could post their definitions?
You could also take a look at the fill factor of your indexes - especially the clustered index. Having a high fill factor could cause excessive IO on your inserts.
Well I Rebuild the indexes with another fill factor (80%) like Sam told me, and the time droped down significantly. It took 30 minutes instead of almost 3hours!!!
I will keep with the tests to fine tune the DB. Also I didnt have to create a clustered index,I guess with the clustered the time will drop a lot more.
Thanks to all, wish that this helps to someone in the same situation.

Resources