Is there a way to get the list of all tables with the last refresh date from a database in the Cloudera Hadoop impala?
I'm trying to write a custom SQL query that can do that so I can use it to build a dashboard (in Tableau) where we can track if a table is refreshed or not. So we can take action accordingly. I tried it using a join but there are so many tables and I believe there is a better way to do it. (Database name Core_research and there are more than 500 tables)
I used to run a script that refreshed column stats on tables every Sunday. We couldn't run all the tables but we did as many as time permitted. You could do the same but actually record when the script ran in database/table. This would give you the functionality you are looking for.
Another other option would be to create a table out of the Impala logs and keep track of things that way. (With some fancy regex to track refreshes)
Related
I am new to Informatica BDM.I have a use case in which I have to import the data incrementally (100 tables) from RDBMS into Hive on daily basis. Can someone please guide me with the best possible approach to achieve this?
Thanks,
Sumit
Hadoop is write onces read many (WORM) approach and the incremental load is not easy stuff. There are following guideline you can follow and validate your current requirement
If the table is a small/mid-size and not having too many records,
better to refresh the entire table
If the table is too big and incremental load has add/update/delete operation, you can think of staging the delta and perform a join operation to re-create data set.
For large table and large delta, you can create a version number for all the latest record and each delta may come to a new directory and a view should be created to get the latest version for further processing. This avoid heavy merge operation.
If the delete operation is not coming as change, then you also need to think how to act on it and in such case, you need to get the full refresh.
I have an issue in production env, one of the work flow is running more tgan one day and inserting records in to sql server db. It s just direct load mapping, there is no sq over ride as well. Monitor shows sq count as 7 million and inseting same no of records inyo target. But source db shows around 3 million records only. How can this be possible?
Have you checked if the source qualifier is joining more than one table? A screenshot of the affected mapping pipeline and obfuscated logfile would help.
Another thought... given your job ran for a day, were there any jobs ran in that time to purge old records from the source table?
Cases when I saw this kind of things happening:
There's a SQL Query override doing something different than I thought (eg. joining some tables)
I'm looking at a different source - verify the connections and make sure to check the same object on the same database at the same server the PowerCenter is connecting to.
It's a reusable session being executed multiple times by different workflows. In such case in workflow monitor it may happen that Source/Target Statistics will refer to another execution.
Wanted some advice on how to deal with table operations (rename column) in Google BigQuery.
Currently, I have a wrapper to do this. My tables are partitioned by date. eg: if I have a table name fact, I will have several tables named:
fact_20160301
fact_20160302
fact_20160303... etc
My rename column wrapper generates aliased queries. ie. if I want to change my table schema from
['address', 'name', 'city'] -> ['location', 'firstname', 'town']
I do batch query operation:
select address as location, name as firstname, city as town
and do a WRITE_TRUNCATE on the parent tables.
My main issues lies with the fact that BigQuery only supports 50 concurrent jobs. This means, that when I submit my batch request, I can only do around 30 partitions at a time, since I'd like to reserve 20 spots for ETL jobs that are runnings.
Also, I haven't found of a way where you can do a poll_job on a batch operation to see whether or not all jobs in a batch have completed.
If anyone has some tips or tricks, I'd love to hear them.
I can propose two options
Using View
Views creation is very simple to script out and execute - it is fast and free to compare with cost of scanning whole table with select into approach.
You can create view using Tables: insert API with properly set type property
Using Jobs: insert EXTRACT and then LOAD
Here you can extract table to GCS and then load it back to GBQ with adjusted schema
Above approach will a) eliminate cost cost of querying (scan) tables and b) can help with limitations. But might not depends on the actual volumke of tables and other requirements you might have
The best way to manipulate a schema is through the Google Big Query API.
Use the tables get api to retrieve the existing schema for your table. https://cloud.google.com/bigquery/docs/reference/v2/tables/get
Manipulate your schema file, renaming columns etc.
Again using the api perform an update on the schema, setting it to your newly modified version. This should all occur in one job https://cloud.google.com/bigquery/docs/reference/v2/tables/update
I'm using bigquery to analyse logs on my website.
There is some simple data which i'm extracting on a weekly basis using a simple SQL query i.e.
SELECT a,b,c from table dates are in week 1
I would like to set up a process where I can get this data automatically into a data set at the end of each week so I dont have to run the query every week and I can store the results so I dont run a query against a lot of history if I need to see it again
What would you advise for this process?
I'd say look into programming a cron job (python, java), to do it for you.
Considering your use-case is pretty easy, it shouldn't be too complicated to set it up
We have an external Hive table that is used for processing raw log file data. The files are hourly, and are partitioned by date and source host name.
At the moment we are importing files using simple python scripts that are triggered a few times per hour. The script creates sub folders on HDFS as needed, copies new files from the temporary local storage and adds any new partitions to Hive.
Today, new partitions are created using "ALTER TABLE ... ADD PARTITION ...". However, if another Hive query is running on the table it will be locked, which means that the add partition command will fail (if the query runs for long enough) since it requires an exclusive lock.
An alternative to this approach would be to use "MSCK REPAIR TABLE", which for some reason does not seem to aquire any locks on the table. However, I have gotten the impression that using repair table is not recommended for a production setting.
What is the best practise for adding Hive partitions programmatically in a concurrent environment?
What are the risks or disadvantages of using MSCK REPAIR TABLE?
Is there an explanation for the seemingly inconsistent locking behaviour of the two partition adding commands? I.e. do they have different effects on running queries?
Not a good answer, but we have the same issue and here are our findings :
in the Hive doc, https://cwiki.apache.org/confluence/display/Hive/Locking , locks seem pretty sensible: an 'ADD partition" will request an exclusive lock on the created partition, and a shared lock on the whole table. A SELECT query will request a shared lock on the table. So it should be fine
however, it does not work this way, at least in CDH 5.3. According to this thread, https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/u7aM9W3pegM this is a known behavior, probably new (I am not sure, but I also think, as the author of this thread, that the issue was not there on CDH 4.7)
So basically, we're still thinking of our partition strategy, but we will probably try to create all possible partition in advance (before getting the data), as we know precisely the values of all future partitions (might not be the case for you).