Hello I am using Azure Data Factory to inject data from Oracle to SQL DB, data are extracted in csv format. The problem is in the source I am reading like 10 Go of Data but when they're injected into Azure SQL DB the data size becomes 40 Go.
Is that normal ? and is there a way to lower the destination data size ?
Thank you
Try setting the table compression level. If the table has many number columns, SQL Server's ROW compression is similar to Oracle's default storage style for numbers and should produce a similar storage size to Oracle
PAGE compression provides additional compression strategies and may give you ~3X compression.
And if the table has many millions of rows, use a clustered cloumnstore table, and may provide ~10X compression.
Related
I am evaluating snowflake for reporting usecase. I am considering snowpipe for ETL. Data is ingested from S3. Data in S3 contains information about user sessions captured at regular interval. In Snowflake, I want to stored these data aggregate. As per documentation snowflake supports only basic transformation and doesn't support group by and join during copying data from S3 staging to tables in snowflake.
I am new to ETL and snowflake. One way i was thinking is to load raw detailed data from staging to temporary table in snowflake. Then run aggregations (group by and join) on temporary table to load data into final fact tables. Is this the correct approach for implementing complex tranformations?
Temporary tables in Snowflake only stick around for the session that they have been created in. This means that you won't be able to point a Snowpipe to it.
Instead of a temporary table, point Snowflake to a transient table to store the raw data and then truncate the table after some period of time. This will reduce costs. Personally, I'd keep the data in the transient table for as long as possible provided that it is not too cost prohibitive. This is to account for potentially late data etc.
Yes, your aproach looks good to me.
Snowpipe loads your data continously from S3 to Snowflake and within Snowflake you use
Views
Tables and Stored Procedures
to transform the data and load it into your final fact table.
How do I transfer huge amount of data(nearly 10TB) from Oracle DB to Snowflake in hours? I see some options like Hevo and fivetran which are paid. However, I need the data to be moved fast so that I need not keep the production system down.
The fastest way to get data into Snowflake is in 10MB to 100MB chunk files. Then, you can leverage a big warehouse to COPY INTO all of the chunk files at one time. I can't speak to how to get the data out of Oracle DB quickly to S3/Azure Blob, though, especially while the system is running its normal workload.
I recommend you look at this document from Snowflake for reference on the Snowflake side: https://docs.snowflake.net/manuals/user-guide/data-load-considerations-prepare.htm
Is there a network speed issue?
Anyways, the data should be compressed when transferred over the network.
There are three locations involved in the staging:
Oracle database,
the extraction client,
and the cloud storage.
You have two data transfers:
between database and client,
and between client and cloud storage.
If the Oracle version is 12cR2 or newer, the DB client can compress data when getting it out of the database. The data should then be compressed again and transferred to cloud storage at your Snowflake destination.
The final step is to load the data from cloud storage into Snowflake (within the same data center)...
Ideally you shouldn't need to keep the production database down. You should be able to categorise the data into
1 - historical data that will not change. You can extract this data at your own leisure, and should not require database to be down.
2 - static data that is fairly stable. You can also extract this data at your leisure
You only need to keep your database fairly stable (not down) when you are extracting the rest of the data. This will require you to build some way to track and validate all your datasets. There is no reason why you couldn't let users continue to read from the database, while you are performing the extract from Oracle.
We have transaction tables in Oracle and for reporting purposes we need this data transfered in real time to another flat Oracle table in another database. The performance of the report is great with table placed in this flat table.
Currently we are using golden gate for replication to the other database and using materialized view for this but due to some problems we need to switch to some other way of populating/maintaining this flat table. What options do we have?
It is a pretty basic requirement but the solutions I can see are for batch processing. Also if there are any other solutions you feel would better serve this purpose. Changing the target database to something other is also an option as there might be more such reports coming ahead.
We have a TDBGrid that connected to TClientDataSet via TDataSetProvider in Delphi 7 with Oracle database.
It goes fine to show content of small tables, but the program hangs when you try to open a table with many rows (for ex 2 million rows) because TClientDataSet tries to load the whole table in memory.
I tried to set "FetchOnDemand" to True for our TClientDataSet and "poFetchDetailsOnDemand" to True in Options for TDataSetProvider, but it does not help to solve the problem. Any ides?
Update:
My solution is:
TClientDataSet.FetchOnDemand = T
TDataSetProvider.Options.poFetchDetailsOnDemand = T
TClientDataSet.PacketRecords = 500
I succeeded to solve the problem by setting the "PacketRecords" property for TCustomClientDataSet. This property indicates the number or type of records in a single data packet. PacketRecords is automatically set to -1, meaning that a single packet should contain all records in the dataset, but I changed it to 500 rows.
When working with RDBMS, and especially with large datasets, trying to access a whole table is exactly what you shouldn't do. That's a typical newbie mistake, or a borrowing from old file based small database engines.
When working with RDBMS, you should load the rows you're interested in only, display/modify/update/insert, and send back changes to the database. That means a SELECT with a proper WHERE clause and also an ORDER BY - remember row ordering is never assured when you issue a SELECT without an OREDER BY, a database engine is free to retrieve rows in the order it sees fit for a given query.
If you have to perform bulk changes, you need to do them in SQL and have them processed on the server, not load a whole table client side, modify it, and send changes row by row to the database.
Loading large datasets client side may fali for several reasons, lack of memory (especially 32 bit applications), memory fragmentation, etc. etc., you will flood the network probably with data you don't need, force the database to perform a full scan, maybe flloding the database cache as well, and so on.
Thereby client datasets are not designed to handle millions of billions of rows. They are designed to cache the rows you need client side, and then apply changes to the remote data. You need to change your application logic.
We're in the process of converting our database from Sybase to Oracle and we've hit a performance problem. In Sybase, we had a TEXT field and replaced it with a CLOB in Oracle.
This is how we accessed the data in our java code:
while(rs.next()) {
String clobValue = rs.getString(1); // This takes 176ms in Oracle!
.
.
}
The database is across the country, but still, we didn't have any performance problems with Sybase and its retrieval of TEXT data.
Is there something we can do to increase this performance?
By default, LOBs are not fetched along with the table data and it takes an extra round-trip to the database to fetch them in getString.
If you are using Oracle's .NET provider, you may set InitialLOBFetchSize in the data reader settings to a value large enough to accommodate your large objects in memory so they could be fetched in all their entirety along with the other data.
Some other options:
Are the LOB columns being stored in-line (in the data row) or out-of-line (in a separate place)? If the LOB columns tend to be small (under 4k in size), you can use the ENABLE STORAGE IN ROW clause to tell Oracle to store the data in-line where possible.
If your LOBs are larger and frequently used, are they being stored in the buffer cache? The default in 10g is that LOBs are NOCACHE, meaning each i/o operation against them involve direct reads to the database, a synchronous disk event, which can be slow. A database trace would reveal significant waits on direct path read / direct path write events.
This chapter of the Oracle Application Developer's Guide - Large Objects would be valuable reading.
We decided to take a different approach which will allow us to ignore clob performance.
Our current code (I didn't write it!) queries a table in the database and retrieves all of the information in the table, including the clobs, even though it wasn't quite necessary to retrieve them all # the time. Instead, we created another field with the first 4k characters in a varchar and query that instead. Then, when we need the full clob, we query it on an individual basis, rather than all clobs for all records.