Change schema in an Impala/Hive table with a very large amount of data? - hadoop

We have a Hive table stored on HDFS with 800+ columns and >65 billion rows (and growing) and need to:
Remove a column with a complex type (small array)
Add a column with a complex type (small array)
Possibly add a handful of other columns (simple type, e.g. string or int)
Modify the contents of 3 columns for every row in the database (effectively read it in, make a simple change, write it back out to the same column and row that it came from). I realise this is probably a separate operation to the other three requirements above.
We could set up a new empty table with the new schema and copy the data over (using CREATE TABLE xxxxx FROM SELECT ... or INSERT INTO xxxx SELECT ...) but tests suggest it would take 1 - 3 weeks running non stop. And it's possible we may need to make further minor similar modifications in future.
Is there an efficient, sensible alternative to copying the whole table? Would ALTER TABLE work (at least for the structural changes, items 1 - 3 above)? What are the pros and cons of either option(s)?
Table is going to be queried using Impala, in a Zeppelin-based interface.
Thanks for any advice.

Related

How to create an index so that the following statement has an index

I have table A(id,name,code)
I have sql statement:
Select * from A where upper(code || name) like upper('%<search text>%');
How to create an index so that the following statement has an index?
Question for two option: table partitioned, and table not partitioned
Thanks & BR
Do you have a performance issue or is this just a hypothetical question?
An index is unlikely to help with this example: a full table scan will probably be the quickest solution. Why? Your table has 3 columns. The best index would be one that avoided looking in the table at all e.g.
create index ai on a (code, name, id);
But that needs to contain all the same data as the table plus a ROWID for each table row - so it is going to be bigger than the table and take longer to scan. You could try putting the columns in the index with the least selective first and using compression:
create index ai on a (code, name, id) compress;
Now the index may be smaller than the table - it depends on how selective the code and name columns are. If it is small enough, the optimizer might decide to use it instead of the table. It still contains all the IDs and ROWIDs so the reduction in size probably won't be dramatic. In the test case I set up the compressed index is about half the size of the table, yet Explain Plan shows the query has a higher cost if I use a hint to force it to use the index - maybe due to overheads of compression, I don't know.
You could look into Oracle Text and the CONTAINS expression - but then you would be writing a different query, not using LIKE.

Oracle 12c - refreshing the data in my tables based on the data from warehouse tables

I need to update the some tables in my application from some other warehouse tables which would be updating weekly or biweekly. I should update my tables based on those. And these are having foreign keys in another tables. So I cannot just truncate the table and reinsert the whole data every time. So I have to take the delta and update accordingly based on few primary key columns which doesn't change. Need some inputs on how to implement this approach.
My approach:
Check the last updated time of those tables, views.
If it is most recent then compare each row based on the primary key in my table and warehouse table.
update each column if it is different.
Do nothing if there is no change in columns.
insert if there is a new record.
My Question:
How do I implement this? Writing a PL/SQL code is it a good and efficient way? as the expected number of records are around 800K.
Please provide any sample code or links.
I would go for Pl/Sql and bulk collect forall method. You can use minus in your cursor in order to reduce data size and calculating difference.
You can check this site for more information about bulk collect, forall and engines: http://www.oracle.com/technetwork/issue-archive/2012/12-sep/o52plsql-1709862.html
There are many parts to your question above and I will answer as best I can:
While it is possible to disable referencing foreign keys, truncate the table, repopulate the table with the updated data then reenable the foreign keys, given your requirements described above I don't believe truncating the table each time to be optimal
Yes, in principle PL/SQL is a good way to achieve what you are wanting to
achieve as this is too complex to deal with in native SQL and PL/SQL is an efficient alternative
Conceptually, the approach I would take is something like as follows:
Initial set up:
create a sequence called activity_seq
Add an "activity_id" column of type number to your source tables with a unique constraint
Add a trigger to the source table/s setting activity_id = activity_seq.nextval for each insert / update of a table row
create some kind of master table to hold the "last processed activity id" value
Then bi/weekly:
retrieve the value of "last processed activity id" from the master
table
select all rows in the source table/s having activity_id value > "last processed activity id" value
iterate through the selected source rows and update the target if a match is found based on whatever your match criterion is, or if
no match is found then insert a new row into the target (I assume
there is no delete as you do not mention it)
on completion, update the master table "last processed activity id" to the greatest value of activity_id for the source rows
processed in step 3 above.
(please note that, depending on your environment and the number of rows processed, the above process may need to be split and repeated over a number of transactions)
I hope this proves helpful

comparing data in two tables taking time

I need to query table1 find all orders and created date ( key is order number an date)).
In table 2 ( key is order number an date) Check if the order exists for a a date.
For this i am scanning table 1 and for each record checking if it exists in table 2. Any better way to do this
In this situation in which your key is identical for both tables, it makes sense to have a single table in which you store both data for Table 1 and Table 2. In that way you can do a single scan on your data and know straight away if the data exists for both criteria.
Even more so, if you want to use this data in MapReduce, you would simply scan that single table. If you only want to get the relevant rows, you could define a filter on the Scan. For example, in the case where you will not be populating rows at all in Table 2, you would simply use a ColumnPrefixFilter
If, however, you do need to keep this data separately in 2 tables, you could pre-split the tables with the same region boundaries for both tables - this will be helpful when you do the query that you are aiming for - load all rows in Table 1 when row exists in Table 2. Essentially this would be a map-side join. You could define multiple inputs in your MapReduce job, and since the region borders are the same, the splits will be such that each mapper will have corresponding rows from both tables. You would probably need to implement your own MultipleInput format for that (the MultiTableInputFormat class recently introduced in 0.96 does not seem to do that map side join)

create index before adding columns vs. create index after adding columns - does it matter?

In Oracle 10g, does it matter what order create index and alter table comes in?
Say i have a query Q with a where clause on column C in table T. Now i perform one of the following scenarios:
I create index I(C) and then add columns X,Y,Z.
Add columns X,Y,Z then create index I(C).
Q is 'select * from T where C = whatever'
Between 1 and 2 will there be a significant difference in performance of Q on table T when T contains a very large number of rows?
I personally make it a practice to do #2 but others seem to have a different opinion.
thanks
It makes no difference if you add columns to a table before or after creating an index. The optimizer should pick the same plan for the query and the execution time should be unchanged.
Depending on the physical storage parameters of the table, it is possible that adding the additional columns and populating them with data may force quite a bit of row migration to take place. That row migration will generate changes to the indexes on the table. If the index exists when you are populating the three new columns with data, it is possible that populating the data in X, Y, and Z will take a bit longer because of the additional index maintenance.
If you add columns without populating them, then it is pretty quick as it is just a metadata change. Adding an index does require the table to be read (or potentially another index) so that can be very time consuming and of much greater impact than the simple metadata change of recording the new index details.
If the new columns are going to be populated as part of the ALTER TABLE, it is a different matter.
The database may undergo an unplanned shutdown during the course of adding that data to every row of the table data
The server memory may not have room to record every row changed in that table
Therefore those row changes may be written to datafiles before commit, and are therefore written as dirty blocks
The next read of those blocks, after the ALTER table has successfully completed will do a delayed block cleanout (ie record the fact that the change has been committed)
If you add the columns (with data) first, then the create index will (probably) read the table and do the added work of the delayed block cleanout.
If you create the index first then add the columns, the create index may be faster but the delayed block cleanout won't happen and that housekeeping will be picked up by the application later (potentially by the select * from T where C = whatever)

oracle - moving data from to identical database

I have two databases with identical table layouts. There are a dozen or so tables of interest. They are a number of FK between them.
I have been asked to write a stored procedure to copy data from database A to database B based on the PK of the parent table at the top of the hierarchy. I may receive just one value, or a list of values. I'm supposed to select all records from database A that match the value(s) and insert/update them into database B. This includes all the records in the child tables too.
My questions is whats the best(most efficent/ best practice) way to do this?
Should I write a dozen select from... insert into... statements?
Should I join the tables together an try to insert into all the tables at the same time?
Thanks!
Additional info:
The record should be inserted if it is not already there. (based on the PK of the respective table). Otherwise it should be updated.
Obviously I need to traverse down to all child tables, so There would only be one record to copy in the parent table, but the child table might have 10, and the child's child table might have 500. I would of course need to update the record if it already existed, insert if it does not for the child tables too...
UPDATE:
I think it would make the solution simpler if I just deleted all records related to the top level key, and then insert all the new records rather than trying to do updates.
So I guess the questions is it best to just do a dozen:
delete from ... where ... in ...
select from ... where ... in ...
insert into...
or is it better to do some kinda of fancy joins to do all the inserts in one sql statement?
I would do this by disabling all the foreign key constraints, then doing a set of MERGE statements to deal with the updates and inserts, then enable all the constraints.
Think about logging. How much redo do you want to generate?
You might find that it's quicker and better to truncate all the target tables and then do inserts of everything with nolog. Could be simpler than the merges.
One major main alternative would be to drop all the target tables and use export and import. Might be a lot faster.
A second alternative would be to use materialized views, particularly if you don't need to do updates on the target tables. That way, Oracle does all the heavy lifting for you. You can force integrity by choosing refresh groups carefully.
There are several ways to deal with this business problem. A PL/SQL program may not be the best.

Resources