check for data transformation oracle etl - oracle

I am new to oracle and I would like to know how do we validate parent child relationship, compare range of values and validate
data types as part of the ETL testing process. (The two tables could be T1 and T2). Please let me know a sample query.
example: T1 is the temporary loading table, And T2 is the new table. we need to make sure that all the data between T1 and T2 is valid with context of range of the value of the variables, the relationships, and data types.
Thanks, Santosh

In order to validate data between two oracle tables following scenarios should be considered-:
1) Data comparison - compare the data between two tables using minus queries.
Select [column names] from tableA
Minus
Select [column names] from Table
2) Business rules - Verify the data complies to the business rules like in this can age between some range.
You can write negative queries to test such scenarios like
Select * from table where (age < x or age > y)
3) Data truncation - Make sure data in target database is not truncated. Make sure that is length of target column is not less than source column or the maximum length of data at source side
4) Data Correctness - Verify that data is not inaccurately recorded, check default values, Field Boundaries, unique Key, Primary key etc.

Related

Different Default ordering between ORACLE and PostgreSQL

I have a simple ORACLE Query which I should rewrite it to be run on postgresql with same output as below
Select X,Y FROM table_name order by Y
in case of I have only the below data in the table
Here you are the difference between PG and oracle in ordering the data
Do you have idea why such this difference occurs?
Different Default ordering
There is no such thing as "default ordering" - neither in Oracle nor in Postgres (or in any other relational database). Tables in a relational database represent un-ordered sets.
You are sorting on a column that contains the same value for both (all) rows. This is essentially the same as not sorting at all, because you have not defined any sort criteria to break those ties. Without an additional sort column the database is free to return the rows with the same sort value in any order it likes.
If you want the rows sorted by column x you need to include that column in the order by
select X,Y
FROM table_name
order by x,y;
or maybe you want order by y,x - it's not clear from your question (and the hardly readable screen shots)

How can I merge two tables using ROWID in oracle?

I know that ROWID is distinct for each row in different tables.But,I am seeing somewhere that two tables are being merged using rowid.So,I also tried to see it,but I am getting the blank output.
I have person table which looks as:
scrowid is the column which contains rowid as:
alter table ot.person
add scrowid VARCHAR2(200) PRIMARY KEY;
I populated this person table as:
insert into ot.person(id,name,age,scrowid)
select id,name, age,a.rowid from ot.per a;
After this I also created another table ot.temp_person by same steps.Both table has same table structure and datatypes.So, i wanted to see them using inner join and I tried them as:
select * from ot.person p inner join ot.temp_person tp ON p.scrowid=tp.scrowid
I got my output as empty table:
Is there is any possible way I can merge two tables using rowid? Or I have forgotten some steps?If there is any way to join these two tables using rowid then suggest me.
Define scrowid as datatype ROWID or UROWID then it may work.
However, in general the ROWID may change at any time unless you lock the record, so it would be a poor key to join your tables.
I think perhaps you misunderstood the merging of two tables via rowid, unless what you actually saw was a Union, Cross Join, or Full Outer Join. Any attempt to match rowid, requardless of you define it, doomed to fail. This results from it being an internal definition. Rowid in not just a data type it is an internal structure (That is an older version of description but Oracle doesn't link documentation versions.) Those fields are basically:
- The data object number of the object
- The data block in the datafile in which the row resides
- The position of the row in the data block (first row is 0)
- The datafile in which the row resides (first file is 1). The file
number is relative to the tablespace.
So while it's possible for different tables to have the same rowid, it would be exteremly unlikely. Thus making an inner join on them always return null.

Query a table in different ways or orderings in Cassandra

I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.
The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)
Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.
The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.

fact table is being populated with too many records

there are 62000 records in my fact table which is not correct because I only have six records in my time dim, 240 records in my student dim and 140 records in my placement dim, does it have something to do with my where clause any help would be mostly appreciated.
INSERT INTO fact_placements (
report_id,
no_of_placements,
no_of_students,
fk1_time_id,
fk2_placement_id,
fk3_student_id )
SELECT
fact_seq.nextval,
no_of_placements,
no_of_students,
time_id,
placement_id,
student_id
FROM
time_dim,
placement_dim,
student_dim
WHERE
placement_dim.year = time_dim.year AND
student_dim.year = time_dim.year;
Unless you do a cartesian join i.e. without any WHERE clause, you will get less than 140 (placement) * 240 (student) * 6 (time) = 201600 fact records. Your current SQL uses the year column in the 3 tables to join, this is filtering down the records to the 62000 you are getting.
Your question title says that even this is "too many". If that is the case, then you would need to understand the granularity of your dimensions and the fact before joining these on any criteria. Are these all at the "year" level, if so do you have 1 record per year in each of these tables and no duplicates based on year?
If not, you might need to re-think the fact tables granularity or alternatively would need to join unique records based on year in each dimension to get the actual (less) number of records you are expecting, which can also be done by summarizing these tables based on year.
Ideally the fact table contains the combinations of the dimension keys with additional column i.e. the factual metrics (in this case no_of_placements and no_of_students). But depending on the available data not all combinations will be present in the fact table.
Also you might want to change the SQL syntax to use the INNER JOIN clause instead of the implied joins using the commas between table names in the FROM clause, as shown below
FROM time_dim
INNER
JOIN placement_dim
ON placement_dim.year = time_dim.year
INNER
JOIN student_dim
ON placement_dim.year = student_dim.year
There's no relationship between placement and student that's why you have so many records.
Your query is saying: Give me all the students and all the placements where year is the same.
I'm not sure that's what you want. What is really strange here is that you are loading a fact table with dimensions tables.

comparing data in two tables taking time

I need to query table1 find all orders and created date ( key is order number an date)).
In table 2 ( key is order number an date) Check if the order exists for a a date.
For this i am scanning table 1 and for each record checking if it exists in table 2. Any better way to do this
In this situation in which your key is identical for both tables, it makes sense to have a single table in which you store both data for Table 1 and Table 2. In that way you can do a single scan on your data and know straight away if the data exists for both criteria.
Even more so, if you want to use this data in MapReduce, you would simply scan that single table. If you only want to get the relevant rows, you could define a filter on the Scan. For example, in the case where you will not be populating rows at all in Table 2, you would simply use a ColumnPrefixFilter
If, however, you do need to keep this data separately in 2 tables, you could pre-split the tables with the same region boundaries for both tables - this will be helpful when you do the query that you are aiming for - load all rows in Table 1 when row exists in Table 2. Essentially this would be a map-side join. You could define multiple inputs in your MapReduce job, and since the region borders are the same, the splits will be such that each mapper will have corresponding rows from both tables. You would probably need to implement your own MultipleInput format for that (the MultiTableInputFormat class recently introduced in 0.96 does not seem to do that map side join)

Resources