hive: how to handle scd type 2 without update - hadoop

currently in our on-prem Hadoop environment we are using hive table with transaction properties. However as we have moving to AWS we don't have that feature yet. and so want to understand how to handle SCD Type 2 without updates.
for example.
for following record.
With Updates
In table with transaction properties enabled, when I get an update for a record, I go ahead and change the end_date to current date and create new record with effective_date as current date and end_date as 12/31/9999, as shows in above table. And so it's easier to find my active record (where end_date = "12/31/9999").
However, if I can't update the past record. I have two records with same end_date. as shows in table below.
My question are.
if I can update end_date of past record,
How do I get the historical duration of stay?
How do i get active record?
without updates

First of all, convert all dates to the 'yyyy-MM-dd' format, so they all will be sortable and analytic functions will work. Then you can use lead(effective_date, '2019-01-01') over(partition by id order by effective_date). For id=1 and effective_date = 2019-01-01 it should give you '2020-08-15' and you can assign this value as end_date for '2019-01-01' record. If there is no record with bigger effective_date, '9999-01-01' will be assigned. After this transformation Active record is that having '9999-01-01'.
Suppose dates are already converted to yyyy-MM-dd, this is how you can rewrite your table (after insert):
insert overwrite table your_table
select name, id, location, effective_date,
lead(effective_date,'2019-01-01') over(partition by id order by effective_date) as end_date
from your_table
Or without doing insert first, you can UNION ALL existing records with new records, in a subquery, then calculate lead.
Actually, SCD2 is not recommended for historical data rewriting because of non-equi join implementation in hive. It is implemented as cross-join + filter (or duplicating join on dim.id=fact.id (this will duplicate rows) + where fact.date<=dim.end_date and fact.date>=dim.effective_date - this should filter one record). This join is very expensive if the dimension and fact are big because of duplication before filtering.

Related

Exago BI - Can you join to the same table multiple times?

Let's say I have a date table.
Let's say I also have another table where I want to join based on sales date and then do another join based upon the shipped date with the goal of joining based on date values in order to get the month name, year, etc from the date table.
Using Exago BI, is it possible to do such a join or would I have to create a view and just put the data in there manually?

Is there any major performance issues if I use virtual columns in Oracle?

Is there any major performance issues if I use virtual columns in an Oracle table?
We have a scenario where the db has fields stored as strings. Since other production apps run off those fields we can't easily convert them.
I am tasked with generating reports from the same db. Since I need to be able to filter by dates (which are stored as strings) it was brought to my attention that we could create a virtual date field so that I can query against that.
Has anyone ran into any roadblocks with this approach?
A virtual column is defined using an expression that is evaluated when you select from the table. There is no performance hit on inserts/updates on the table.
For example:
create table t1 (
datestr varchar2(100),
datedt date generated always as (to_date(datestr,'YYYYMMDD'))
);
Table created.
SQL> insert into t1 (datestr) values ('20160815');
1 row created.
SQL> insert into t1 (datestr) values ('xxx');
1 row created.
SQL> commit;
Commit complete.
Note that I was able to insert an invalid date value into datestr. Now we can try to select the data:
SQL> select * from t1 where datedt = date '2016-08-15';
ERROR:
ORA-01841: (full) year must be between -4713 and +9999, and not be 0
This could be a problem for you if you can't guarantee all the strings hold valid dates.
As for performance, when you run the above query what you are really running is:
select * from t1 where to_date(datestr,'YYYYMMDD') = date '2016-08-15';
So the query will not be able to use an index on the datestr column (probably), and you may want to add an index on the virtual column. Again, this won't work if any of the strings don't contain valid dates.
Another consideration is potential impact on existing code. Hopefully you won't have any code like insert into t1 values (...); i.e. not specifying the column list. If you do you will get the error:
ORA-54013: INSERT operation disallowed on virtual columns

Is it possible to use SYSDATE as a column alias?

If I want to run a report daily and store the report's date as one of the column headers. Is this possible?
Example output (Counting the activities of employees for that day):
SELECT EMPLOYEE_NAME AS EMPLOYEE, COUNT(ACTIVITY) AS "Activity_On_SYSDATE" FROM EMPLOYEE_ACCESS GROUP BY EMPLOYEE_NAME;
Employee Activity_On_17042016
Jane 5
Martha 8
Sam 11
You are looking to do a reporting job with a data storing tool. The database (and SQL) is for storing and retrieving data, not for creating reports. There are special tools for creating reports.
In database design, it is very unhealthy to encode actual data in table or column name. Neither a table name nor a column name should have, as part of the name (and of the way they are used), an employee id, a date, or any other bit of actual data. Actual data should only be in fields, which in turn are in columns in different tables.
From what you describe, your base table should have columns for employee, activity and date. Then on any given day, if you want the count for the "current" day, you can query with
select employee, count(activity) ct
from table_name
where activity_date = SYSDATE
group by employee
If you want, you can also include the "activity_date" column in the output, that will show for which date the report was run.
Note that I assumed the column name for "date" is "activity_date." And in the output I used "ct" for a column alias, not "count." DATE and COUNT are reserved words, like SYSDATE, and you should NOT use them as table or column name. You could use them as aliases, as long as you don't need to refer to these aliases anywhere else in SQL, but it is still a very bad idea. Imagine you ever need to refer to a column (by name or by alias) and the name or alias is SYSDATE. What would a where clause like this mean?
where sysdate = sysdate
Do you see the problem?
Also, I can't tell from your question - were you thinking of storing these reports back in the database? To what end? It is better to store just one query and run it whenever needed (and make the "activity_date" for which you want the counts) an input parameter, so you can run the query for any date, at any time in the future. There is no need to store the actual daily reports in the database, as long as the base table is properly maintained.
Good luck!

Informatica issue - transformation

I want to know how to get latest updated record via Informatica. Suppose I have 10 records in a temporary table. 3 records for Account1, 3 for Account2 and 4 records for Account3. Now out of these 3 accounts, I need fetch only those records which has maximum date value (Latest date) and insert in another temporary table. So which transformations I could use to get this or informatica logic I should use? Please help.
If the Date column comes from input with unique date , based on that use the aggregator transformation and take the maximum date.
If no date column is present, please assign system timestamp but cannot take maximum date from this. You have to go for some other logic like rowid and rownum features.
If the source is DB, we can do it in SQ itself - write a temp table by grouping by the pk field, and select this pk field and max(date). then join this output with the original source based on pk and date.
for eg:
select * from src_table
join ( select pk,max(date) as maxdate from src_table ) aggr_table
on src_table.pk=aggr_table.pk
and src_table.date=aggr_table.maxdate
same can be implemented inside informatica using an aggregator and joiner. but since the aggregator source is sq and again it's output is joining with sq, one sorter will be required in between the aggregator and joiner.
you can use aggregator transformation. You can use sorter transformation first with sorting based on account and date asc.After that you can aggregator transformation (grouping based on account). You dont need to add any condition or grouping function, as aggregator will give last records of every group.

SQL Server - How to auto-delete "old" database records?

I have a database table which storing shop list for users. I wish to store only 12 shop list per user, means if currently user1 has 12 records in the table, once user1 create a new shop list, the 1st shop list (oldest) will be deleted and the new shop list will be stored.
The ShopList table consist of ShopListID (PK), UserID (FK) and a LastUpdatedDate will is updated by a trigger once user insert/delete any shoplist item belong to the shoplist.
I got no idea how to do this at all.. is it using trigger? or stored procedure? really need help here...
Appreciate any feedback.. Thanks...
You can do this via a trigger or a procedure. You can also in your service layer/ business ligic layer query for the count there upon a save and remove the old records as well. Im for the business logic approach as it's more testable and keeps business logic out of triggers or procedures , so my recommendation is a code based approach.
I'd personally change the select query to only select the top 12 so that will control what the user can see.
I'd then use a database job that runs on a schedule that deletes the ones that you don't want.
I have come across this problem recently and it really depends on your "archiving" strategy.
What I have done is that I created a stored procedure that selects the records to be archived element onwards for every user account (my requirement is very similar to yours in the sense that i have to select the 31st element onwards in a user account). I can also give you some code here if you think it will come in handy.
I have created an extra table called XXXX_archive which is a clone of the schema on your shopping_list table(s). This is to insert old, archived records there in case a user asks to retrieve his list in the future (this is obviously optional but would come in handy).
The stored procedure finds this records and inserts them in the XXXX_archive table and then deletes them from the XXXX. This runs on a nightly basis (or whenever you feel its necessary) through the SQL Server Agent.
The effect is that the 13th element is not deleted the moment that the user creates another shopping list but i think thats fine cause you are in charge of your archiving strategy and can describe it in your TOS.
Just thought I should write my experience here cause i sorted out this problem just days ago.
EDIT: My stored proc is as follows:
INSERT into shopping_lists_archive
SELECT *
FROM shopping_lists
WHERE id in (
select id
from (
SELECT ROW_NUMBER() OVER (
PARTITION BY user_ID
ORDER BY user_ID desc) AS RowNumber,
id, user_ID
FROM shopping_lists c
where c.user_ID in (select USER_ID from shopping_lists group by user_id having COUNT(1) > 12)
) t
where rownumber > 12
)
DELETE FROM shopping_lists
WHERE id in (
select id
from (
SELECT ROW_NUMBER() OVER (
PARTITION BY user_ID
ORDER BY user_ID desc) AS RowNumber,
id, user_ID
FROM shopping_lists c
where c.user_ID in (select USER_ID from shopping_lists group by user_id having COUNT(1) > 12)
) t
where rownumber > 12
)
There you go - it may be slightly different than what you need cause i m archiving based on a join between two tables and had to amend my original query to your requirement.

Resources