How to create Row Sequence Number in ORC tables - hadoop

I would like to add Row Numbering or Row Sequence ID column which will automatically increment row id value. Hive UDF UDFRowSequence can be used but it runs in single reducer. I would like to know is there any other feature in latest hive 0.14 to increment row sequence automatically in oRC.

You might like to look at the ROW_NUMBER() function in a window covering the whole set. It relies on the data being sorted, but should therefore allow parallel partition processing.

Related

Oracle 12c - refreshing the data in my tables based on the data from warehouse tables

I need to update the some tables in my application from some other warehouse tables which would be updating weekly or biweekly. I should update my tables based on those. And these are having foreign keys in another tables. So I cannot just truncate the table and reinsert the whole data every time. So I have to take the delta and update accordingly based on few primary key columns which doesn't change. Need some inputs on how to implement this approach.
My approach:
Check the last updated time of those tables, views.
If it is most recent then compare each row based on the primary key in my table and warehouse table.
update each column if it is different.
Do nothing if there is no change in columns.
insert if there is a new record.
My Question:
How do I implement this? Writing a PL/SQL code is it a good and efficient way? as the expected number of records are around 800K.
Please provide any sample code or links.
I would go for Pl/Sql and bulk collect forall method. You can use minus in your cursor in order to reduce data size and calculating difference.
You can check this site for more information about bulk collect, forall and engines: http://www.oracle.com/technetwork/issue-archive/2012/12-sep/o52plsql-1709862.html
There are many parts to your question above and I will answer as best I can:
While it is possible to disable referencing foreign keys, truncate the table, repopulate the table with the updated data then reenable the foreign keys, given your requirements described above I don't believe truncating the table each time to be optimal
Yes, in principle PL/SQL is a good way to achieve what you are wanting to
achieve as this is too complex to deal with in native SQL and PL/SQL is an efficient alternative
Conceptually, the approach I would take is something like as follows:
Initial set up:
create a sequence called activity_seq
Add an "activity_id" column of type number to your source tables with a unique constraint
Add a trigger to the source table/s setting activity_id = activity_seq.nextval for each insert / update of a table row
create some kind of master table to hold the "last processed activity id" value
Then bi/weekly:
retrieve the value of "last processed activity id" from the master
table
select all rows in the source table/s having activity_id value > "last processed activity id" value
iterate through the selected source rows and update the target if a match is found based on whatever your match criterion is, or if
no match is found then insert a new row into the target (I assume
there is no delete as you do not mention it)
on completion, update the master table "last processed activity id" to the greatest value of activity_id for the source rows
processed in step 3 above.
(please note that, depending on your environment and the number of rows processed, the above process may need to be split and repeated over a number of transactions)
I hope this proves helpful

How to add new partitions automatically in partitioned table based on data inserted in GreenPlum.?

I have a partitioned table in greenplum(modeled after psql), which has been partitioned with specific range of values.
Now, i have to insert the data again into the same table. New values for Partitions might overlap with existing ones. I have created alter command with new start and end dates. But, if the overlaps are there, the command fails. So, i need to create partition for each date, in order to avoid whole command failure.
Just wondering, if there is a way in greenplum to create partitions based on the inserted data automatically, just like hive does.
thanks for your help.
Greenplum does not (currently) create additional partitions for data which does not fit into an existing partition.
If you have a default partition on the table it will receive all the records which do not fit into one of the specified partitions. You can then use ALTER TABLE ... SPLIT DEFAULT PARTITION (see the documentation if required) to create the new partitions for any new dates at the end of the load batch.

Tune a query with ROWID and ORA_ROWSCN

I have a following query which is taking lot of time as table is very big, this query also fetching pseudo columns ROWID and ORA_ROWSCN.
select ROWID, ORA_ROWSCN, t.C1, t.c2, t.c5, t.c7, t.c9 from tab t
I tried using hint ALL_ROWS and ran the stats as well but still not much of help. Please suggest. Thanks a lot in advance.
Rowid and ora_rowscn both reside inside the data block.
Rowid is composed of:
The data object number of the object
The data block in the datafile in which the row resides
The position of the row in the data block (first row is 0)
The datafile in which the row resides (first file is 1). The file number is relative to the tablespace.
The ora_rowscn gives you the last change number for the block in which the row resides (not the row itself, be aware of that).
None of which iscontributing significantly to the total time to retrieve all the rows from big table (unless you were using scn_to_timestamp function).
the problem here is that wou have no WHERE clause and it does take a lot of time to retrieve all the rows from a big table. If you truly need all the rows, isn't there any column you can use to chop the query into many smaller queries, sou you could start getting the results faster, maybe even paralelyze the whole process (a date column, or an ID column you could use a mod on the where clause, something like that)?

comparing data in two tables taking time

I need to query table1 find all orders and created date ( key is order number an date)).
In table 2 ( key is order number an date) Check if the order exists for a a date.
For this i am scanning table 1 and for each record checking if it exists in table 2. Any better way to do this
In this situation in which your key is identical for both tables, it makes sense to have a single table in which you store both data for Table 1 and Table 2. In that way you can do a single scan on your data and know straight away if the data exists for both criteria.
Even more so, if you want to use this data in MapReduce, you would simply scan that single table. If you only want to get the relevant rows, you could define a filter on the Scan. For example, in the case where you will not be populating rows at all in Table 2, you would simply use a ColumnPrefixFilter
If, however, you do need to keep this data separately in 2 tables, you could pre-split the tables with the same region boundaries for both tables - this will be helpful when you do the query that you are aiming for - load all rows in Table 1 when row exists in Table 2. Essentially this would be a map-side join. You could define multiple inputs in your MapReduce job, and since the region borders are the same, the splits will be such that each mapper will have corresponding rows from both tables. You would probably need to implement your own MultipleInput format for that (the MultiTableInputFormat class recently introduced in 0.96 does not seem to do that map side join)

create index before adding columns vs. create index after adding columns - does it matter?

In Oracle 10g, does it matter what order create index and alter table comes in?
Say i have a query Q with a where clause on column C in table T. Now i perform one of the following scenarios:
I create index I(C) and then add columns X,Y,Z.
Add columns X,Y,Z then create index I(C).
Q is 'select * from T where C = whatever'
Between 1 and 2 will there be a significant difference in performance of Q on table T when T contains a very large number of rows?
I personally make it a practice to do #2 but others seem to have a different opinion.
thanks
It makes no difference if you add columns to a table before or after creating an index. The optimizer should pick the same plan for the query and the execution time should be unchanged.
Depending on the physical storage parameters of the table, it is possible that adding the additional columns and populating them with data may force quite a bit of row migration to take place. That row migration will generate changes to the indexes on the table. If the index exists when you are populating the three new columns with data, it is possible that populating the data in X, Y, and Z will take a bit longer because of the additional index maintenance.
If you add columns without populating them, then it is pretty quick as it is just a metadata change. Adding an index does require the table to be read (or potentially another index) so that can be very time consuming and of much greater impact than the simple metadata change of recording the new index details.
If the new columns are going to be populated as part of the ALTER TABLE, it is a different matter.
The database may undergo an unplanned shutdown during the course of adding that data to every row of the table data
The server memory may not have room to record every row changed in that table
Therefore those row changes may be written to datafiles before commit, and are therefore written as dirty blocks
The next read of those blocks, after the ALTER table has successfully completed will do a delayed block cleanout (ie record the fact that the change has been committed)
If you add the columns (with data) first, then the create index will (probably) read the table and do the added work of the delayed block cleanout.
If you create the index first then add the columns, the create index may be faster but the delayed block cleanout won't happen and that housekeeping will be picked up by the application later (potentially by the select * from T where C = whatever)

Resources