I have a table with ~800k records and with ~100 fields.
The table has an ID field which is a unique NVACHAR (18) type.
The table has, also, a field called LastModifiedDate which holds the latest changes that were made.
I’m trying to perform an incremental load based on the following:
Initial load of all data (happens once)
Loading, based on LastModifiedDate, only recent changed/added records (~30k)
Based on the key field (ID), performing INSERT/UPDATE on recent data to the existing data
(*) assuming records are not deleted
I’m trying to achieve this by doing the following steps:
Truncate the temp table (which holds the recent data)
Extracting the recent data and storing it in the temp table
Extracting the data from the temp table
Using Lookup with the following definitions:
a. Cache mode = Full Cache
b. Connection Type = OLE DB connection manager
c. No matching entries = Ignore failure
Selecting ID from the final table and linking it to the ID field from temp table and giving the new filed an output alias LKP_ID
Using Conditional Split and checking if ISNULL(LKP_ID) when true means INSERT and false means UPDATE
INSERT means that that the data from temp table will be inserted to the final table and UPDATE means that an SQL UPDATE statement will be executed based on the temp table data
the final result is good BUT the run time is terrible. it takes ~30 minutes or so to complete
The way I would handle this is to use the LastModifiedDate in your source query to get the records from the source table that have changed since the last import.
Then I would import all of those records into an empty staging table on the destination database server.
Then I would execute a stored procedure to do the INSERT/UPDATE of the final destination table from the data in the staging table. A stored procedure on the destination server will perform MUCH faster than using Lookups and Conditional Splits in SSIS.
Related
I'm currently trying to create txt files from all tables in the dbo schema
I have like 200s-300s tables there, so it would takes up too much times to create it manually..
I was thinking for creating a loop.
so as example (using AdventureWorks2019) :
select t.name as table_name
from sys.tables t
where schema_name(t.schema_id) = 'Person'
order by table_name;
This would get all the table name within the Person schema.
So I would loop :
Table input : select * from ${table_name}
But then i realized that for txt files, i need to declare all the field and their data types in pentaho, so it would become a problems.
Any ideas how to do this "backup" txt files?
Using Metadata Injection and more queries to the schema catalog tables in SQL Server. You not only need to retrieve the table name, you would need to afterwards retrieve the columns in that table and the data types, and inject that information (metadata) to the text output step.
You have in the samples directory of your spoon installation an example on how to use Metadata Injection, use it, along with the documentation, to build a simple example (the check to generate a transformation with the metadata you have injected is of great use to debug)
I have something similar to copy data from one database to another, both in Oracle, but with SQL Server you have similar catalog tables as in Oracle to retrieve the information you need. I created a simple, almost empty transformation to read one table and write to another. This transformation has almost no information, only the database origin in the Table Input step and the target database in the Table Output step:
And then I have a second transformation where I fill up all the information (metadata) to inject: The query to perform in the Table Input step, and all the data I need in the Table Output: Target table, if I need to truncate before inserting, the columns from (stream field) and to (Table field):
I'm using the Data Skipping Indexes feature in clickhouse and i got confused about its usage. If i add a data skip index when i create the table like this:
CREATE TABLE MyTable
(
...
INDEX index_time TimeStamp TYPE minmax GRANULARITY 1
)
ENGINE =MergeTree()
...
When i query with TimeStamp filter condition the 'index_time' works. But if i didn't add index when creating table, alternatively, i added the index with Manipulations With Data Skipping Indicesfeature like this:
ALTER TABLEE MyTable ADD INDEX index_time TimeStamp TYPE minmax GRANULARITY 1
Then the index 'index_time' doesn't work.
My database is running on production so i can't recreate the table. I have to use the second way. Can anyone explain why it does not work or i used the feature in a wrong way?
The reason your queries don't use the index after an ALTER TABLE ADD INDEX is because the index does not exist yet. (!)
Any new data will be properly indexed, which is why your index works when you put it in CREATE TABLE. ClickHouse builds the index as you load data. If you created the table, ran ALTER TABLE ADD INDEX, and loaded data you would see the same behavior.
When the data already exist, things are different. ALTER TABLE updates the metadata for the table, but at this point all your data have been written to parts in the table. ClickHouse does not rewrite parts automatically to implement new indexes. However, you should be able to force rewriting to include the index by running:
OPTIMIZE TABLE MyTable FINAL
See the Github issue https://github.com/yandex/ClickHouse/issues/6561 referenced by Ruijang for more information.
It's quite right that
OPTIMIZE TABLE my_table_name FINAL;
does recreate the indexes set to the table. But there's some scenarios in a columnar DB where you want to avoid rewriting EVERYTHING. If you just add a single index to an already existing table with lots of data when you just rebuilt the new index which includes two steps:
Step 1 - Define the index
Creating the INDEX itself just defines what the index should do, which reflects in Clickhouse as metadata that's added to the table. Thus there is no index build up really, thus nothing will be faster. It's also a lightweight operation as it won't change data or build up any structures beside the table metadata.
It's important to understand any new incoming data will be indexed on insert, but any pre existing data is not included!
ALTER TABLE my_table_name ADD INDEX my_index(my_expression) TYPE minmax GRANULARITY 1
Note Clickhouse can index expressions, so it could simply be the column name as in the question or a more complex expression (e.g. my_index(price * sold_items * revshare)). The index will work on that expression only of course.
Step 2 - Build up (materialize) the index
After creation of the metadata the index for existing data need to be build up. This action is called materialize and needs to be explicitly triggered. Good thing is you can do this individually for any index that was added or changed. This is a heavy operation as it'll trigger work on the database.
ALTER TABLE my_table_name MATERIALIZE INDEX my_index;
Also have a look at the Clickhouse docs for Manipulating Data Skipping Indices
I am trying to load the tables into oracle in-memory database. I have enable the tables for INMEMORY by using sql+ command ALTER TABLE table_name INMEMORY. The table also contains data i.e. the table is populated. But when I try to use the command SELECT v.owner, v.segment_name name, v.populate_status status from v$im_segments v;, it shows no rows selected.
What can be the problem?
Have you considered this?
https://docs.oracle.com/database/121/CNCPT/memory.htm#GUID-DF723C06-62FE-4E5A-8BE0-0703695A7886
Population of the IM Column Store in Response to Queries
Setting the INMEMORY attribute on an object means that this object is a candidate for population in the IM column store, not that the database immediately populates the object in memory.
By default (INMEMORY PRIORITY is set to NONE), the database delays population of a table in the IM column store until the database considers it useful. When the INMEMORY attribute is set for an object, the database may choose not to materialize all columns when the database determines that the memory is better used elsewhere. Also, the IM column store may populate a subset of columns from a table.
You probably need to run a select against the date first
For Source: OLE DB Source - Sql Command
SELECT -- The destination table Id has IDENTITY(1,1) so I didn't take it here
[GsmUserId]
,[GsmOperatorId]
,[SenderHeader]
,[SenderNo]
,[SendDate]
,[ErrorCodeId]
,[OriginalMessageId]
,[OutgoingSmsId]
,24 AS [MigrateTypeId] --This is a static value
FROM [MyDb].[migrate].[MySource] WITH (NOLOCK)
To Destination: OLE DB Destination
Takes 5 or more minutes to insert 1.000.000 data. I even unchecked Check Constraints
Then, with the same SSIS configurations I wanted to test it with another table exactly the same as the Destination table. So, I re-create the destination table (with the same constrains except the inside data) and named as dbo.MyDestination.
But it takes about 30 seconds or less to complete the SAME data with the same amount of Data.
Why is it significantly faster with the test table and not the original table? Is it because the original table already has 107.000.000 data?
Check for indexes/triggers/constraints etc. on your destination table. These may slow things down considerably.
Check OLE DB connection manager's Packet Size, set it appropriately, you can follow this article to set it to right value.
If you are familiar with of SQL Server Profiler, then use it to get more insight especially what happens when you use re-created table to insert data against original table.
Here's the deal; the issue isn't with getting the CSV into SQL Server, it's getting it to work how I want it... which I guess is always the issue :)
I have a CSV file with columns like: DATE, TIME, BARCODE, etc... I use a derived column transformation to concatenate the DATE and TIME into a DATETIME for my import into SQL Server, and I import all data into the database. The issue is that we only get a new .CSV file every 12 hours, and for example sake we will say the .CSV is updated four times in a minute.
With the logic that we will run the job every 15 minutes, we will get a ton of overlapping data. I imagine I will use a variable, say LastCollectedTime which can be pulled from my SQL database using the MAX(READTIME). My problem comes in that I only want to collect rows with a readtime more recent than that variable.
Destination table structure:
ID, ReadTime, SubID, ...datacolumns..., LastModifiedTime where LastModifiedTime has a default value of GETDATE() on the last insert.
Any ideas? Remember, our readtime is a Derived Column, not sure if it matters or not.
Here is one approach that you can make use of:
Let's assume that your destination table in SQL Server is named BarcodeData.
Create a staging table (say BarcodeStaging) in your database that has the same column structure as your destination table BarcodeData into which CSV data is imported into.
In the SSIS package, add an Execute SQL Task before the Data Flow Task to truncate the staging table BarcodeStaging.
Import the CSV data into the staging table BarcodeStaging and not into the actual destination table.
Use the MERGE statement (I assume that you are using SQL Server 2008 or higher version), to compare the staging table BarCodeStaging and the actual destination table BarcodeData using the DateTime column as the join key. If there are unmatched rows, then copy the rows from the staging table and insert them into the destination table.
Technet link to MERGE statement: http://technet.microsoft.com/en-us/library/bb510625.aspx
Hope that helps.