currently in our on-prem Hadoop environment we are using hive table with transaction properties. However as we have moving to AWS we don't have that feature yet. and so want to understand how to handle SCD Type 2 without updates.
for example.
for following record.
With Updates
In table with transaction properties enabled, when I get an update for a record, I go ahead and change the end_date to current date and create new record with effective_date as current date and end_date as 12/31/9999, as shows in above table. And so it's easier to find my active record (where end_date = "12/31/9999").
However, if I can't update the past record. I have two records with same end_date. as shows in table below.
My question are.
if I can update end_date of past record,
How do I get the historical duration of stay?
How do i get active record?
without updates
First of all, convert all dates to the 'yyyy-MM-dd' format, so they all will be sortable and analytic functions will work. Then you can use lead(effective_date, '2019-01-01') over(partition by id order by effective_date). For id=1 and effective_date = 2019-01-01 it should give you '2020-08-15' and you can assign this value as end_date for '2019-01-01' record. If there is no record with bigger effective_date, '9999-01-01' will be assigned. After this transformation Active record is that having '9999-01-01'.
Suppose dates are already converted to yyyy-MM-dd, this is how you can rewrite your table (after insert):
insert overwrite table your_table
select name, id, location, effective_date,
lead(effective_date,'2019-01-01') over(partition by id order by effective_date) as end_date
from your_table
Or without doing insert first, you can UNION ALL existing records with new records, in a subquery, then calculate lead.
Actually, SCD2 is not recommended for historical data rewriting because of non-equi join implementation in hive. It is implemented as cross-join + filter (or duplicating join on dim.id=fact.id (this will duplicate rows) + where fact.date<=dim.end_date and fact.date>=dim.effective_date - this should filter one record). This join is very expensive if the dimension and fact are big because of duplication before filtering.
How to compare table data structure.
1. Any table added or deleted.
2. Any column in the tables added or deleted.
So my job is to verify if any table or columns are added/deleted on 1st of every month.
My plan is to run a sql query and take a copy of entire list of tables and it's data type only (NO DATA) and save it in txt file or something and use it as base line, and next month run the same sql query and get the results and compare the file. is it possible? please help with the sql query which can do this job.
This query will give you a list of all tables and their columns for a given user (just replace ABCD in this query for the user you have to audit and providing you have access to all that users tables this will work).
SELECT table_name,
column_name
FROM all_tab_columns
WHERE owner = 'ABCD'
ORDER
BY table_name,
column_id;
This answers your question but I have to agree with a_horse_with_no_name that is not a good way implement change control, most notably because the changes have already happened.
This query is very basic and doesn't give you all the information you'd need to see if a column has changed (or any information about other objects types etc), but then you only asked about additions and deletions of tables and columns and you can compare the output of this script to previous outputs to find the answer to your allotted task.
Is there any major performance issues if I use virtual columns in an Oracle table?
We have a scenario where the db has fields stored as strings. Since other production apps run off those fields we can't easily convert them.
I am tasked with generating reports from the same db. Since I need to be able to filter by dates (which are stored as strings) it was brought to my attention that we could create a virtual date field so that I can query against that.
Has anyone ran into any roadblocks with this approach?
A virtual column is defined using an expression that is evaluated when you select from the table. There is no performance hit on inserts/updates on the table.
For example:
create table t1 (
datestr varchar2(100),
datedt date generated always as (to_date(datestr,'YYYYMMDD'))
);
Table created.
SQL> insert into t1 (datestr) values ('20160815');
1 row created.
SQL> insert into t1 (datestr) values ('xxx');
1 row created.
SQL> commit;
Commit complete.
Note that I was able to insert an invalid date value into datestr. Now we can try to select the data:
SQL> select * from t1 where datedt = date '2016-08-15';
ERROR:
ORA-01841: (full) year must be between -4713 and +9999, and not be 0
This could be a problem for you if you can't guarantee all the strings hold valid dates.
As for performance, when you run the above query what you are really running is:
select * from t1 where to_date(datestr,'YYYYMMDD') = date '2016-08-15';
So the query will not be able to use an index on the datestr column (probably), and you may want to add an index on the virtual column. Again, this won't work if any of the strings don't contain valid dates.
Another consideration is potential impact on existing code. Hopefully you won't have any code like insert into t1 values (...); i.e. not specifying the column list. If you do you will get the error:
ORA-54013: INSERT operation disallowed on virtual columns
I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.
The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)
Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.
The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.
I have a table with a huge volume of data. I need partitioning to be done on a daily basis automatically. I need the name of the partition to be the date of sysdate. How can I go about doing this?
There is currently (11gR2) no way to specify a name for the auto-generated partitions in an interval-partitioned table. See Common Questions On Interval Partitioning [ID 1479115.1] (Oracle support account required):
What will be the names of the automatically created interval partitions?
[...] Currently it is not possible to specify a mask or template for partition names, but the system generated name can be renamed [...]
You're also restricted to a partition key column which must be of type DATE or NUMBER, and a few other things (see that note).
You can follow the example in the Creating Partitions documentation for the syntax:
create table foo (date_created date, ...)
partition by range(date_created)
interval(numtodsinterval(1, 'DAY'))
(partition one values less than (to_date('01012013', 'DDMMYYYY')));
With the above, a new partition will be created whenever you insert a row with a date value this year or later. New partitions will not be created for dates before 2013.
To work around the partition name issue (if necessary at all), you could rename the partitions based on HIGH_VALUE in USER_TAB_PARTITIONS, although that doesn't sound very nice.
Another option is to not rename them at all and use this syntax when you want to query a specific partition:
select *
from foo
partition for (<the day you're interested in>);
See for example: Oracle Interval Partitioning Tips.