Import most recent data from CSV to SQL Server with SSIS - business-intelligence

Here's the deal; the issue isn't with getting the CSV into SQL Server, it's getting it to work how I want it... which I guess is always the issue :)
I have a CSV file with columns like: DATE, TIME, BARCODE, etc... I use a derived column transformation to concatenate the DATE and TIME into a DATETIME for my import into SQL Server, and I import all data into the database. The issue is that we only get a new .CSV file every 12 hours, and for example sake we will say the .CSV is updated four times in a minute.
With the logic that we will run the job every 15 minutes, we will get a ton of overlapping data. I imagine I will use a variable, say LastCollectedTime which can be pulled from my SQL database using the MAX(READTIME). My problem comes in that I only want to collect rows with a readtime more recent than that variable.
Destination table structure:
ID, ReadTime, SubID, ...datacolumns..., LastModifiedTime where LastModifiedTime has a default value of GETDATE() on the last insert.
Any ideas? Remember, our readtime is a Derived Column, not sure if it matters or not.

Here is one approach that you can make use of:
Let's assume that your destination table in SQL Server is named BarcodeData.
Create a staging table (say BarcodeStaging) in your database that has the same column structure as your destination table BarcodeData into which CSV data is imported into.
In the SSIS package, add an Execute SQL Task before the Data Flow Task to truncate the staging table BarcodeStaging.
Import the CSV data into the staging table BarcodeStaging and not into the actual destination table.
Use the MERGE statement (I assume that you are using SQL Server 2008 or higher version), to compare the staging table BarCodeStaging and the actual destination table BarcodeData using the DateTime column as the join key. If there are unmatched rows, then copy the rows from the staging table and insert them into the destination table.
Technet link to MERGE statement: http://technet.microsoft.com/en-us/library/bb510625.aspx
Hope that helps.

Related

Creating txt file using Pentaho

I'm currently trying to create txt files from all tables in the dbo schema
I have like 200s-300s tables there, so it would takes up too much times to create it manually..
I was thinking for creating a loop.
so as example (using AdventureWorks2019) :
select t.name as table_name
from sys.tables t
where schema_name(t.schema_id) = 'Person'
order by table_name;
This would get all the table name within the Person schema.
So I would loop :
Table input : select * from ${table_name}
But then i realized that for txt files, i need to declare all the field and their data types in pentaho, so it would become a problems.
Any ideas how to do this "backup" txt files?
Using Metadata Injection and more queries to the schema catalog tables in SQL Server. You not only need to retrieve the table name, you would need to afterwards retrieve the columns in that table and the data types, and inject that information (metadata) to the text output step.
You have in the samples directory of your spoon installation an example on how to use Metadata Injection, use it, along with the documentation, to build a simple example (the check to generate a transformation with the metadata you have injected is of great use to debug)
I have something similar to copy data from one database to another, both in Oracle, but with SQL Server you have similar catalog tables as in Oracle to retrieve the information you need. I created a simple, almost empty transformation to read one table and write to another. This transformation has almost no information, only the database origin in the Table Input step and the target database in the Table Output step:
And then I have a second transformation where I fill up all the information (metadata) to inject: The query to perform in the Table Input step, and all the data I need in the Table Output: Target table, if I need to truncate before inserting, the columns from (stream field) and to (Table field):

Retrieve from dynamically named tables

I'm using Talend Open Studio for Data Integration.
I have tables that are generated every day and table names are suffixed by date like so
dailystats20220127
dailystats20220126
dailystats20220125
dailystats20220124
I have two-part question.
I want to look at the table which has yesterday's date in it so sysdate - 1 and I want to fetch data from yesterday
select 'dailystats' ||to_char(sysdate - 1,'YYYYMMDD') TableName
from dual;
How do I retrieve schema for a dynamic name?
How do I pull data from that table.
I've worked with static table names and its a straightforward process.
If the schema is always the same, you just define it once in your input component.
In your input component set the sql as :
"select [fields] from dailystats"+ TalendDate.formatDate("yyyyMMdd", TalendDate.addDate(TalendDate.getCurrentDate(), -1, "dd"))

Hive not creating separate directories for skewed table

My hive version is 1.2.1. I am trying to create a skewed table but it clearly doesn't seem to be working. Here is my table creation script:-
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.mytable
(
country string,
payload string
)
PARTITIONED BY (year int,month int,day int,hour int)
SKEWED BY (country) on ('USA','Brazil') STORED AS DIRECTORIES
STORED AS TEXTFILE;
INSERT OVERWRITE TABLE mydb.mytable PARTITION(year = 2019, month = 10, day=05, hour=18)
SELECT country,payload FROM mydb.mysource;
The select query returns names of countries and some associated string data (payload). So, based on the way I have specified skewing on the column 'country' I was expecting the insert statement to cause creation of separate directories for USA & Brazil (the select query returns enough rows with country as USA & Brazil), but this clearly didn't happen. I see that hive created directory called 'HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME' and all the values went into a single file in that directory. Skewed table is only supposed to send rows with default values (those not specified in table creation statement) to common directory (which is what HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME seems to be) and should create dedicated directories for the rows with skew values. But instead all is going to the default directory and the other directory isn't even created. Do I have to toggle any hive options to make this thing work?
It looks like old bug, doesn't look like it's fixed yet. https://issues.apache.org/jira/browse/HIVE-13697. Basically internally when Hive stores these skew values specified during the table creation, they are converted to lower case before storing in the metastore. That's why the workaround for now is to convert case in the select statement so it goes to the right bucket. I tested this and this way it works.

SSIS Incremental Load Performace

I have a table with ~800k records and with ~100 fields.
The table has an ID field which is a unique NVACHAR (18) type.
The table has, also, a field called LastModifiedDate which holds the latest changes that were made.
I’m trying to perform an incremental load based on the following:
Initial load of all data (happens once)
Loading, based on LastModifiedDate, only recent changed/added records (~30k)
Based on the key field (ID), performing INSERT/UPDATE on recent data to the existing data
(*) assuming records are not deleted
I’m trying to achieve this by doing the following steps:
Truncate the temp table (which holds the recent data)
Extracting the recent data and storing it in the temp table
Extracting the data from the temp table
Using Lookup with the following definitions:
a. Cache mode = Full Cache
b. Connection Type = OLE DB connection manager
c. No matching entries = Ignore failure
Selecting ID from the final table and linking it to the ID field from temp table and giving the new filed an output alias LKP_ID
Using Conditional Split and checking if ISNULL(LKP_ID) when true means INSERT and false means UPDATE
INSERT means that that the data from temp table will be inserted to the final table and UPDATE means that an SQL UPDATE statement will be executed based on the temp table data
the final result is good BUT the run time is terrible. it takes ~30 minutes or so to complete
The way I would handle this is to use the LastModifiedDate in your source query to get the records from the source table that have changed since the last import.
Then I would import all of those records into an empty staging table on the destination database server.
Then I would execute a stored procedure to do the INSERT/UPDATE of the final destination table from the data in the staging table. A stored procedure on the destination server will perform MUCH faster than using Lookups and Conditional Splits in SSIS.

updating Hive external table with HDFS changes

lets say, I created Hive external table "myTable" from file myFile.csv ( located in HDFS ).
myFile.csv is changed every day, then I'm interested to update "myTable" once a day too.
Is there any HiveQL query that tells to update the table every day?
Thank you.
P.S.
I would like to know if it works the same way with directories: lets say, I create Hive partition from HDFS directory "myDir", when "myDir" contains 10 files. next day "myDIr" contains 20 files (10 files were added). Should I update Hive partition?
There are two types of tables in Hive basically.
One is Managed table managed by hive warehouse whenever you create a table data will be copied to internal warehouse.
You can not have latest data in the query output.
Other is external table in which hive will not copy its data to internal warehouse.
So whenever you fire query on table then it retrieves data from the file.
SO you can even have the latest data in the query output.
That is one of the goals of external table.
You can even drop the table and the data is not lost.
If you add a LOCATION '/path/to/myFile.csv' clause to your table create statement, you shouldn't have to update anything in Hive. It will always use the latest version of the file in queries.

Resources