I’m brand new to big data echo system but I have good SQL knowledge and I have worked only in relational databases. There is a scenario in my case. We have a table in Hive which records error details from the log. My requirement is whenever data is inserted into the error log table system, it should trigger an alert mail. I’m looking for a kind of “database trigger” . I know a trigger is not possible in a Hive table since it is a warehouse table. My question is: Is there any workaround to achieve this?
I propose that you use rather elasticsearch for your need, with watcher or xpack you generate alerts. Hive here is not the good technoligy for your needs
Related
Is there a way to get the list of all tables with the last refresh date from a database in the Cloudera Hadoop impala?
I'm trying to write a custom SQL query that can do that so I can use it to build a dashboard (in Tableau) where we can track if a table is refreshed or not. So we can take action accordingly. I tried it using a join but there are so many tables and I believe there is a better way to do it. (Database name Core_research and there are more than 500 tables)
I used to run a script that refreshed column stats on tables every Sunday. We couldn't run all the tables but we did as many as time permitted. You could do the same but actually record when the script ran in database/table. This would give you the functionality you are looking for.
Another other option would be to create a table out of the Impala logs and keep track of things that way. (With some fancy regex to track refreshes)
We have transaction tables in Oracle and for reporting purposes we need this data transfered in real time to another flat Oracle table in another database. The performance of the report is great with table placed in this flat table.
Currently we are using golden gate for replication to the other database and using materialized view for this but due to some problems we need to switch to some other way of populating/maintaining this flat table. What options do we have?
It is a pretty basic requirement but the solutions I can see are for batch processing. Also if there are any other solutions you feel would better serve this purpose. Changing the target database to something other is also an option as there might be more such reports coming ahead.
I am new to Informatica BDM.I have a use case in which I have to import the data incrementally (100 tables) from RDBMS into Hive on daily basis. Can someone please guide me with the best possible approach to achieve this?
Thanks,
Sumit
Hadoop is write onces read many (WORM) approach and the incremental load is not easy stuff. There are following guideline you can follow and validate your current requirement
If the table is a small/mid-size and not having too many records,
better to refresh the entire table
If the table is too big and incremental load has add/update/delete operation, you can think of staging the delta and perform a join operation to re-create data set.
For large table and large delta, you can create a version number for all the latest record and each delta may come to a new directory and a view should be created to get the latest version for further processing. This avoid heavy merge operation.
If the delete operation is not coming as change, then you also need to think how to act on it and in such case, you need to get the full refresh.
I have a question on Hive Views Partitions.
I have a base table which is partitioned on a Date Field. My View is a simple view which does a select * from the base table.
My Question is would the view be Partition aware when a view is queried y the end user? or do i need to execute any other commands to be able to use the partitions by view?
I am having this question because of the following statement in wiki.apache.org https://cwiki.apache.org/confluence/display/Hive/PartitionedView on this topic which mentioned:
1.One possible approach mentioned in HIVE-1079 is to infer view partitions automatically based on the partitions of the underlying tables. A command such as SHOW PARTITIONS could then synthesize virtual partition descriptors on the fly. This is fairly easy to do for use case #1, but potentially very difficult for use cases #2 and #3. So for now, we are punting on this approach.
Regards,
Nish
At my prior engagement we used views extensively and all of our tables were partitioned. We relied on the ability of the hive query planner to perform proper partition pruning in these views and it did so successfully. In fact there were several edge cases/complicated scenarios that required updates to the hive source code by Hortonworks. But in the general/simpler cases the partition pruning was working.
I need some help in auditing in Oracle. We have a database with many tables and we want to be able to audit every change made to any table in any field. So the things we want to have in this audit are:
user who modified
time of change occurred
old value and new value
so we started creating the trigger which was supposed to perform the audit for any table but then had issues...
As I mentioned before we have so many tables and we cannot go creating a trigger per each table. So the idea is creating a master trigger that can behaves dynamically for any table that fires the trigger. I was trying to do it but no lucky at all....it seems that Oracle restricts the trigger environment just for a table which is declared by code and not dynamically like we want to do.
Do you have any idea on how to do this or any other advice for solving this issue?
If you have 10g enterprise edition you should look at Oracle's Fine-Grained Auditing. It is definitely better than rolling your own.
But if you have a lesser version or for some reason FGA is not to your taste, here is how to do it. The key thing is: build a separate audit table for each application table.
I know this is not what you want to hear because it doesn't match the table structure you outlined above. But storing a row with OLD and NEW values for each column affected by an update is a really bad idea:
It doesn't scale ( a single update touching ten columns spawns ten inserts)
What about when you insert a record?
It is a complete pain to assemble the state of a record at any given time
So, have an audit table for each application table, with an identical structure. That means including the CHANGED_TIMESTAMP and CHANGED_USER on the application table, but that is not a bad thing.
Finally, and you know where this is leading, have a trigger on each table which inserts a whole record with just the :NEW values into the audit table. The trigger should fire on INSERT and UPDATE. This gives the complete history, it is easy enough to diff two versions of the record. For a DELETE you will insert an audit record with just the primary key populated and all other columns empty.
Your objection will be that you have too many tables and too many columns to implement all these objects. But it is simple enough to generate the table and trigger DDL statements from the data dictionary (user_tables, user_tab_columns).
You don't need write your own triggers.
Oracle ships with flexible and fine grained audit trail services. Have a look at this document (9i) as a starting point.
(Edit: Here's a link for 10g and 11g versions of the same document.)
You can audit so much that it can be like drinking from the firehose - and that can hurt the server performance at some point, or could leave you with so much audit information that you won't be able to extract meaningful information from it quickly, and/or you could end up eating up lots of disk space. Spend some time thinking about how much audit information you really need, and how long you might need to keep it around. To do so might require starting with a basic configuration, and then tailoring it down after you're able to get a sample of the kind of volume of audit trail data you're actually collecting.