I have a complex query that runs a long time (e.g 30 minutes) in Snowflake when I run it in the Snowflake console. I am making the same query from a JVM application using JDBC driver. What appears to happen is this:
Snowflake processes the query from start to finish, taking 30 minutes.
JVM application receives the rows. The first receive happens 30 minutes after the query started.
What I'd like to happen is that Snowflake starts to send rows to my application while it is still executing the query, as soon as data is ready. This way my application could start processing the rows in the first 30 minutes.
Is this possible with Snowflake and JDBC?
First of all, I would request to check the Snowflake warehouse size and do the tuning. It's not worth waiting for 30 mins when by resizing of the warehouse, the query time can be reduced one fourth or less than that. By doing any of the below, your cost will be almost the same or low. The query execution time will be reduced linearly as you increase the warehouse size. Refer the link
Scale up by resizing a warehouse.
Scale out by adding clusters to a warehouse (requires Snowflake
Enterprise Edition or higher).
Now coming to JDBC, I believe it behaves the same way as for other databases as well
I'm using MONDRIAN server and OLAP4j API in a Java Web application, i have a performance issues when adding a where close to my queries.
MDX query like :
SELECT
CrossJoin(
{[Product.ProductHierarchie].[AllProduct]}
, {[Measures].[Quantity]}
) ON COLUMNS,
[Client.ClientHierarchie].[AllClient].Children ON ROWS
FROM [sales_data_cube]
0.3 second to be done. But when adding a where clause, like
WHERE ([Period].&[start_period]:[Period].&[end_period]),
to get the sales between a start/end periods, the query take more than 250 seconds with a small fact table (8500 rows).
What i should do to have a better performance?
The application is running on a tomcat server with memory limit = 8GB, Data base server : MySQL 5.6.17
Finally, the problem was in the configuration of Mondrian.
Mondrian use the logging package (log4j) where actually call 'Debug'-method every time they compared two objects when using a where condition.
The solution is to change the log4j-configuration and stopping the 'Debug'-mod. I have added this simple code to set up log4j, before the creation of OLAP Connection:
Logger.getRootLogger().setLevel(Level.OFF);
Logger.getRootLogger().removeAllAppenders();
Logger.getRootLogger().addAppender(new NullAppender());
Class.forName("mondrian.olap4j.MondrianOlap4jDriver");
Connection connolap = ...
more details about this issue (Mondrian Slow MDX Query)
I am connecting to a remote Oracle DB using MS Access 2010 and ODBC for Oracle driver
IN MS Access it takes about 10 seconds to execute:
SELECT * FROM SFMFG_SACIQ_ISC_DRAWING_REVS
But takes over 20 minutes to execute:
SELECT * INTO saciq_isc_drawing_revs FROM SFMFG_SACIQ_ISC_DRAWING_REVS
Why does it take so long to build a local table with the same data?
Is this normal?
The first part is reading the data and you might not be getting the full result set back in one go. The second is both reading and writing the data which will always take longer.
You haven't said how many records you're retrieving and inserting. If it's tens of thousands then 20 minutes (or 1200 seconds approx.) seems quite good. If it's hundreds then you may have a problem.
Have a look here https://stackoverflow.com/search?q=insert+speed+ms+access for some hints as to how to improve the response and perhaps change some of the variables - e.g. using SQL Server Express instead of MS Access.
You could also do a quick speed comparison test by trying to insert the records from a CSV file and/or Excel cut and paste.
I am writing a proof-of-concept app which is intended to take live clickstream data at the rate of around 1000 messages per second and write it to Amazon Redshift.
I am struggling to get anything like the performance some others claim (for example, here).
I am running a cluster with 2 x dw.hs1.xlarge nodes (+ leader), and the machine that is doing the load is an EC2 m1.xlarge instance on the same VPC as the Redshift cluster running 64 bit Ubuntu 12.04.1.
I am using Java 1.7 (openjdk-7-jdk from the Ubuntu repos) and the Postgresql 9.2-1002 driver (principally because it's the only one in Maven Central which makes my build easier!).
I've tried all the techniques shown here, except the last one.
I cannot use COPY FROM because we want to load data in "real time", so staging it via S3 or DynamoDB isn't really an option, and Redshift doesn't support COPY FROM stdin for some reason.
Here is an excerpt from my logs showing that individual rows are being inserted at the rate of around 15/second:
2013-05-10 15:05:06,937 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 170
2013-05-10 15:05:18,707 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:05:18,708 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 712
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Beginning batch of 167
2013-05-10 15:06:14,381 [pool-1-thread-2] INFO uk.co...redshift.DatabaseWriter - Done
What am I doing wrong? What other approaches could I take?
Redshift (aka ParAccel) is an analytic database. The goal is enable analytic queries to be answered quickly over very large volumes of data. To that end Redshift stores data in a columnar format. Each column is held separately and compressed against the previous values in the column. This compression tends to be very effective because a given column usually holds many repetitive and similar data.
This storage approach provides many benefits at query time because only the requested columns need to be read and the data to be read is very compressed. However, the cost of this is that inserts tend to be slower and require much more effort. Also inserts that are not perfectly ordered may result in poor query performance until the tables are VACUUM'ed.
So, by inserting a single row at a time you are completely working against the the way that Redshift works. The database is has to append your data to each column in succession and calculate the compression. It's a little bit (but not exactly) like adding a single value to large number of zip archives. Additionally, even after your data is inserted you still won't get optimal performance until you run VACUUM to reorganise the tables.
If you want to analyse your data in "real time" then, for all practical purposes, you should probably choose another database and/or approach. Off the top of my head here are 3:
Accept a "small" batching window (5-15 minutes) and plan to run VACUUM at least daily.
Choose an analytic database (more $) which copes with small inserts, e.g., Vertica.
Experiment with "NoSQL" DBs that allow single path analysis, e.g., Acunu Cassandra.
The reason single inserts are slow is the way Redshift handles commits. Redshift has a single queue for commit.
Say you insert row 1, then commit - it goes to the redshift commit queue to finish commit.
Next row , row 2, then commit - again goes to the commit queue. Say during this time if the commit of row 1 is not complete, row 2 waits for the commit of 1 to complete and then gets started to work on row 2 commit.
So if you batch your inserts, it does a single commit and is faster than single commits to the Redshift system.
You can get commit queue information via the issue Tip #9: Maintaining efficient data loads in the link below.
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
We have been able to insert 1000 rows / sec in Redshift by batching several requests together in the same INSERT statement (in our case we had to batch ~200 value tuples in each INSERT). If you use an ORM layer like Hibernate, you can configure it for batching (eg see http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html)
I've been able to achieve 2,400 inserts/second by batching writes into transactions of 75,000 records per transaction. Each record is small, as you might expect, being only about 300 bytes per record.
I'm querying a MariaDB installed on an EC2 instance and inserting the records into RedShift from the same EC2 instance that Maria is installed on.
UPDATE
I modified the way I was doing writes so that it loads the data from MariaDB in 5 parallel threads and writes to RedShift from each thread. That increased performance to 12,000+ writes/second.
So yeah, if you plan it correctly you can get great performance from RedShift writes.
Setup:
Entity Framework 4 with lazy loading enabled (model-first, table-per-hierarchy).
Number of table is about 40 (and no table has more than 15-20 fields).
SQL Server Express 2008 (not r2).
No database triggers or any other stuff like this exist - it is only used for storage. All the logic is in the code.
Database size at the moment is approx 2gb.
(Primary keys are Guids and are generated in code via Guid.NewGuid() - if this matters)
Saving a complex operation result (which produces a complex object graph) takes anywhere from 40 to 60 seconds (the number returned by SaveChanges is approx. 8000 - mostly added objects and a some modified).
Saving the same operation result with an empty (or an almost empty) database usually takes around 1 seconds on the same computer.
The only variable that seems to affect this issue is the database size. But please note that I am only measuring the Context.SaveChages() call (so even if I have some weird sluggish queries somewhere that should not affect this issue).
Any suggestions as to why this operation may last this long are appreciated.
UPDATE 1
Just to clarify - the code that takes 40-60 seconds to execute is (it takes this long only when the DB size is around 2gb):
Stopwatch sw = Stopwatch.StartNew();
int count = objectContext.SaveChanges(); // this method is not overridden
Debug.Write(sw.ElapsedMilliseconds); // prints out 40000 - 60000 ms
Debug.Write(count); // I am testing with exactly the same operation and the
// result always gives the same count for it (8460)
The same operation with an empty DB takes around 1000 ms (while still giving the same count - 8460). Thus the question would be - how could database size affect SaveChanges()?
Update 2
Running a perf profiler shows that the main bottleneck (from "code perspective") is the following method:
Method: static SNINativeMethodWrapper.SNIReadSync
Called: 3251 times
Avg: 10.56 ms
Max: 264.25 ms
Min: 0.01 ms
Total: 34338.51 ms
Update 3
There are non-clustered indexes for all PKs and FKs in the database. We are using random Guids as surrogate keys (not sequential) thus fragmentation is always at very high levels. I tried testing executing the operation in question right after rebuilding all DB indexes (fragmentation was less than 2-3% for all indexes) but it did not seems to improve the situation in any way.
In addition I must say that during the operation in question one table involved in the process has approximately 4 million rows (this table gets lots of inserts). SQL Profiler shows that inserts to that table can last anywhere from 1 to 200 ms (this is a "spike"). Yet again, it does not seem that this changes in case indexes are freshly rebuilt.
In any case - it seems (at the moment) that the problem is on the SQL Server side of the application since the main thing taking up time is that SNIReadSync method. Correct me if I am being completely ignorant.
It hard to guess without profiler, but 8000 of records is definitely too many. Usually EF 4 works ok with up to couple of hundreds objects. I would not be surprised if it turns that change tracking takes most of this time. EF 5 and 6 have some performance optimizations, so if you cannot decrease number of tracked objects somehow, you could experiment with them.