Hive table or view? Which should be the right approach? - hadoop

I am new to HDFS/HIVE. Need some advice. I have a background of RDBMS Data modelling.
I have a requirement of a daily report. The report requires fetching of data from two staging Tables(HIVE).
What if I create a table in HIVE, write a view to fetch records from staging to populate HIVE table. create a HIVE view pointing to HIVE table with where clause of selecting one-day data?
HIVE staging tables ---> 2. View to populate HIVE table --> 3. HIVE table ----> 4. View to fetch data from HIVE table created in 3.
what if I create a view on top of two staging HIVE tables (joining two tables with where clause to fetch one-day data)?
HIVE staging tables ---> 2. View to fetch data from HIVE staging tables
I want to know HIVE best practice and solution strategies.

View or not View but you need ETL process to load tables. ETL process can join, aggregate, etc, so you will be able use finally joined and aggregated data in the form star/snowflake or report table. Why do you need Views here? To reuse some common queries, to reduce complexity of some long complex queries, make interfaces to data, create logical entities, etc. You do not necessarily need View simply to join tables and load data to another table. All depends on your requirements. If reports should query data fast then data should be precalculated by ETL process. View is just wrapper over query, it will be calculated each time you query data.

I think its best if you have zero views, 1 single table, and make your partition the date field (but you can't partition on the date, so you have to store it as a string) ... this make it easier for the end user to have only 1 table... fewer tables.
This gives your users the ability to engage only the latest date they want, or leverage the full table.

Related

How to use external table in hive?

Can anyone please explain why and where do we use external tables in hive?
Please explain a scenario to understand easily.
We use external table when our underlying dataset pointed by hive table is shared by many purpose i.e for map reduce job, pig etc and use managed table in hive when our dataset pointed by hive table is used only by hive application.
Actually in hive managed table has full control on dataset i.e in managed table if you will drop the table dataset will also be deleted from hive warehouse(/usr/hive/warehouse) present in HDFS, but in case of external table when you drop the table, dataset are not deleted from hive warehouse in HDFS.
Suppose take an example you have 50 gb data set now if you create multiple copies of dataset for different purpose it will simply take more space so the better option is to use external table so that when you drop the table dataset are not deleted and you can use it further by any other application like by pig or by any other purpose.
As a rule of thumb: use external table if you plan to work with those data not only from Hive but from other frameworks as well. Otherwise make it internal.
The only difference between External and Managed table in Hive is Drop table or Drop partition behavior. For Managed it will drop data as well, for External table the data will remain untouched in the table/partition location.
Use External in most cases. External table allows you to change table definition easily. Also you can create few tables on top of the same location.
Use Managed table if the table is temporary/intermediate and data should be deleted to free space.
Managed table can be converted to external and vice-versa using
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');

De-duplication from two hive tables

We are stuck with a problem where-in we are trying to do a near real time sync between a RDBMS(Source) and hive (Target). Basically the source is pushing the changes (inserts, updates and deletes) into HDFS as avro files. These are loaded into external tables (with avro schema), into the Hive. There is also a base table in ORC, which has all the records that came in before the Source pushed in the new set of records.
Once the data is received, we have to do a de-duplication (since there could be updates on existing rows) and remove all deleted records (since there could be deletes from the Source).
We are now performing a de-dupe using rank() over partitioned keys on the union of external table and base table. And then the result is then pushed into a new table, swap the names. This is taking a lot of time.
We tried using merges, acid transactions, but rank over partition and then filtering out all the rows has given us the best possible time at this moment.
Is there a better way of doing this? Any suggestions on improving the process altogether? We are having quite a few tables, so we do not have any partitions or buckets at this moment.
You can try with storing all the transactional data into Hbase table.
Storing data into Hbase table using Primary key of RDBMS table as Row Key:-
Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format.
You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records.
Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table.
As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table.
If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table.
Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey,
if we are going to have millions of records then we need to do some tuning to the hive queries
Tuning Hive Queries That Uses Underlying HBase Table

Perform bulk data insert from multiple tables in oracle

Please do advice the best way to perform the bulk data load from multiple tables to single table.
Need to pivot the data from two tables compare the same with third table and load the data to fourth table.

Why we need to move external table to managed hive table?

I am new to Hadoop and learning Hive.
In Hadoop definative guide 3rd edition page no. 428 last paragraph
I don't understand below paragraph regarding external table in HIVE.
"A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table."
Can anybody explain briefly what above phrase says?
Usually the data in the initial dataset is not constructed in the optimal way for queries.
You may want to modify the data (like modifying some columns adding columns, making aggregation etc) and to store it in a specific way (partitions / buckets / sorted etc) so that the queries would benefit from these optimizations.
The key difference between external and managed table in Hive is that data in the external table is not managed by Hive.
When you create external table you define HDFS directory for that table and Hive is simply "looking" in it and can get data from it but Hive can't delete or change data in that folder. When you drop external table Hive only deletes metadata from its metastore and data in HDFS remains unchanged.
Managed table basically is a directory in HDFS and it's created and managed by Hive. Even more - all operations for removing/changing partitions/raw data/table in that table MUST be done by Hive otherwise metadata in Hive metastore may become incorrect (e.g. you manually delete partition from HDFS but Hive metastore contains info that partition exists).
In Hadoop definative guide I think author meant that it is a common practice to write MR-job that produces some raw data and keeps it in some folder. Than you create Hive external table which will look into that folder. And than safelly run queries without the risk to drop table etc.
In other words - you can do MR job that produces some generic data and than use Hive external table as a source of data for insert into managed tables. It helps you to avoid creating boring similar MR jobs and delegate this task to Hive queries - you create query that takes data from external table, aggregates/processes it how you want and puts the result into managed tables.
Another goal of external table is to use as a source data from remote servers, e.g. in csv format.
There is no reason to move table to managed unless you are going to enable ACID or other features supported only for managed tables.
The list of differences in features supported by managed/external tables may change in future, better use current documentation. Currently these features are:
ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed
tables
DROP deletes data for managed tables while it only deletes
metadata for external ones
ACID/Transactional only works for
managed tables
Query Results Caching only works for managed
tables
Only the RELY constraint is allowed on external tables
Some Materialized View features only work on managed tables
You can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344
Data structure has nothing in common with external/managed table type. If you want to change structure you do not necessarily need to change table managed/external type
It is also mentioned in the book.
when your table is external table.
you can use other technologies like PIG,Cascading or Mapreduce to process it .
You can also use multiple schemas for that dataset.
and You can also create data lazily if it is external table.
when you decide that dataset should be used by only Hive,make it hive managed table.

Whats the best way to perform selective record replication at an Oracle database

Suppose the following scenario:
I have a master database that contains lots of data, in this database I have a key table that I'm going to call DataOwners for this example, the DataOwners table has 4 records, each record of each of the other tables in the database "belongs" directly or indirectly to a record of the DataOwners, and by belongs I mean is linked to it with foreign keys.
I also have other 2 slave databases with the exact same structure from my master database whose are only updated through replication from my master database, but SlaveDatabase1 should only have records from DataOwner 2 and SlaveDatabase2 should only have records from DataOwners 1 and 3 whereas MasterDatabase has records of DataOwners 1, 2, 3 and 4.
Is there any tool for Oracle that allows me to do this kind of selective record replication?
If not, is there any way to improve my replication method? which is:
add to each table a trigger that inserts the record changes in a group of replication tables
execute the commands of the replication tables at selected slaves
The simplest option would be to define materialized views in the various slave databases that replicate just the data that you want. So, for example, if there is a table A in the master database, then in slave database 1, you'd create a materialized view
CREATE MATERIALIZED VIEW a
<<refresh criteria>>
AS
SELECT a.*
FROM a#to_master a,
dataOwners#to_master dm
WHERE a.dataOwnerID = dm.dataOwnerID
AND dm.some_column = <<some criteria that selects DataOwner2>>
while slave database 2 has a very similar materialized view
CREATE MATERIALIZED VIEW a
<<refresh criteria>>
AS
SELECT a.*
FROM a#to_master a,
dataOwners#to_master dm
WHERE a.dataOwnerID = dm.dataOwnerID
AND dm.some_column = <<some criteria that selects DataOwner1 & 3>>
Of course, if the dataOwnerID can be hard-coded, you could simplify things and avoid doing the join. I'm guessing, though, that there is some column in the DataOwners table that identifies which slave a particular owner should be replicated to.
Assuming that you want only incremental changes to be replicated, you'd need to create some materialized view logs on the base tables in the master database. And you would probably want to configure refresh groups on the slave databases so that all the materialized views would refresh at the same time and would be transactionally consistent with each other.
Oracle Golden Gate software can do all these tasks. Insert/Update/Delete have the same order of the master db, so it can avoid the foreign keys and other constraint issues.
MasterDatabase Extract generates a trail file, then split out the data to DB 1,2,3 and 4.
It also can do multiple ways replications, i.e. DB 1 sends data back to the Master DB.
Besides the Golden Gate, trigger may be your other option. But it requires some programming.

Resources