How to safely update hive external table - hadoop

I have an external hive table and I would like to refresh the data files on a daily basis. What is the recommended way to do this?
If I just overwrite the files, and if we are unlucky enough to have some other hive queries to execute in parallel against this table, what will happen to those queries? Will they just fail? Or will my HDFS operations fail? Or will they block until the queries complete?

If availability is a concern and space isn't an issue, you can do the following:
Make a synonym for the external table. Make sure all queries use this synonym when accessing the table.
When loading new data, load it to a new table with a different name.
When the load is complete, point the synonym to the newly loaded table.
After an appropriate length of time (long enough for any running queries to finish), drop the previous table.

First of all.. if you are accessing any table it may have two types of locks:
exclusive(if data is getting added) and shared(if data is getting read)..
so if you insert overwrite and add data into the table then at that time if you access the table with other queries, they wont get executed because there will be an exclusive lock on it and once the insert overwrite query completes then you may access the table.
Please refer to the following link:
https://cwiki.apache.org/confluence/display/Hive/Locking

Related

Oracle Data Integrator- ODI 12.2.1--Loadplan Issue no of records count issue

I come across a scenario in my project.I am loading data from file to Table using ODI.I am running My interfaces through loadplan.I've 1000 Records in my source file,and also getting 1000 records in target file.but when I'm checking ODI loadplan execution log its showing number of insert is 2000.can anyone please help.or is it a ODI bug.?
The number of inserts does not only show the inserts in the target table but also all the insert happening in temporary tables. Depending on the knowledge modules (KMs) used in an interface, ODI might load data in a C$_ table (LKM) or I$_ table (IKM/CKM). The rows loaded in these table will also be counted.
You can look at the code generated in the operator to check if your KMs are using using these temporary. You can also simulate an execution to see the code generated.

Dynamic Audit Trigger

I want to keep logs of all tables into 1 single log table. Suppose if any DML operation is going on any table inside DB. Than that should be logged in 1 single tables.
But there should be a dynamic trigger which will not hard coded the column names for every table.
Is there any solution for this.
Regards,
Somdutt Harihar
"Is there any solution for this"
No. This is not how databases work. Strongly enforced data structures is what they do, and that applies to audit tables just as much as transaction tables.
The reason is quite clear: the time you save not writing audit code specific to each transactional table is the time you will spend writing a query to retrieve the audit records. The difference is, when you're trying to get the audit records out you will have your boss standing over your shoulder demanding to know when you can tell them what happened to the payroll records last month. Or asking how long it will take you to produce that report for the regulators, are you trying to make the company look like a bunch of clowns? You get the picture. This is not where you want to be.
Also, the performance of a single table to store all the changes to all the tables in the database? That is going to be so slow, you have no idea.
The point is, we can generate the auditing code. It is easy to write some SQL which interrogates the data dictionary and produces DDL for the target tables and triggers to populate those tables.
In fact it gets even easier in 11.2.0.4 and later because we can use FLASHBACK DATA ARCHIVE (formerly Oracle Total Recall) to build and maintain such journalling functionality automatically, and query it automatically with the as of syntax. Find out more.
Okay, so technically there is a solution. You could have a trigger on each table which executes some dynamic PL/SQL to interrogate the data dictionary and assembles a piece of JSON which you stuff into your single table. The single table could be partitioned by day range and sub-partitioned by table name (assuming you have licensed the Partitioning option) to mitigate the performance of querying it.
But that is extremely complex. Running dynamic PL/SQL for every DML statement will have a bad effect on performance, which the users will notice. And this still doesn't solve the fundamental problem of retrieving the audit trail when you need it.
To audit DML actions on any table just enable such audit by using following code:
audit insert table, update table, delete table;
All actions with tables will then be logged to sys.dba_audit_object table.
Audit will only log timestamp, user, host and other params, not exact copies of new or old rows.

What is the use of disable operation in hbase?

I know it is to disallow anyone from performing any operation on a table, when a schema change is going to be made.
> disable ‘table_name’
But I want more clarification on it. Why should we disallow others to perform any operation on it? Is it just because wrong and unexpected results would be given when a query is made while a schema change is undergoing...!
HBase is a strictly consistent NoSQL database in case of reads and writes.
So achieving consistency is very important for HBase during DB operations.
HBase demands disabling table in case of altering schema changes and dropping tables.
HBase doesn't have a protocol to tell all the regions to update the schema changes online. So we need to disable the table before alter it.
HBase table drop is two step procedure:
Closing all the regions. i.e disable the table
Dropping them. i.e drop the table.
So We must disable all operations except a few operations like list, is_enabled, is_disabled etc... on the table before dropping it.

Hive reading while inserting

What would happen if I'm trying to read from a hive table , while concurrently there is someone inserting into the hive table.
Does it lock the files while querying into it, or does it do a dirty read???
Hadoop is meant for parallel proceesing. So in a cluster parallel querying can be done on a hive table.
While some data is being inserted in the table if another user queries the same table, the files are not locked, rather a job is accepted is put to execution.
Now if the new data insert is successful before the second query
is processed the the result of the second query will
acknowledge the inserted data.
Note: In most cases data insertion takes lesser time than
querying a table because while querying MR jobs are created and
are run in backend

How to update/insert a table without creating a new table (temporary or otherwise)

Background: My team has an etl job that updates an aggregate table. Each row contains data for a particular date, but this row can and will get updated after the row date (which means any row can contain data from multiple jobs). This ETL job missed some data for one day last week and now I need to backfill it.
Problem: I have the missing data, and what I was planning on doing was dumping that data into a temporary table and then merging it with the agg table. That way I can deal with whether the ETL job already contains a row for that data (update) or whether a new row needs to be added (insert), but I don't have sufficient permissions to create a temp table, and I'd prefer not to involve the DBA.
Question: Can I do an insert/update sort of behavior without creating a temporary table (this is Oracle SQL by the way).
Edit: The data is coming from a tsv file.
Why do you want to avoid involving the DBA? The DBA should have full knowledge of what's going on in the database, as they are ultimately responsible for the condition of the data within it. So you shouldn't be playing sneaky commando with them.
As you have a file of missing data, the easiest way to present it to the database is with an external table. This requires the creation of the table and probably a directory object as well. You will need the DBA's help with this task.
The only way to avoid creating database objects is to convert your TSV file into a series of DML statements. An IDE which supports regex and/or records macros will prove invaluable here. I like TextPad; other editors are available.
The DML statement for doing upserts in Oracle is the MERGE statement. The one thing you need to watch for is recency. Your missing data comes from last week. If a row exists it may have have been added or amended in the intervening period. You must write your MERGE statement so it does not overwrite more recent data with the older stuff. Hopefully your table has useful metadata columns such as DATE_CREATED and LAST_UPDATED.

Resources