Find a table across databases in hadoop - hadoop

I would like to find a specific table across multiple databases in Hadoop. I'm looking for an automatic solution since there dozens of databases involved.
Is there a hive command that could help me to do this?
Or I have to write something in bash instead?
Thanks

You can simply query your metastore. In my case i have mysql to be the metastore. so i did it this way
connect to your metastore. e.g mysql -uUser -hHost -pPassword
Use your metsatore db e.g use metastoredb;
select * from TBLS where TBL_NAME='table_name';
I have queried three columns and here is the output that i got.
select TBL_ID,DB_ID,TBL_NAME from TBLS where TBL_NAME='ri_reg_datamodels_tmp';
ask me if you get any issue with it

Related

Can Datagrip do cross DB queries/joins?

I'm looking to use DG to join data from Snowflake and Impala or Hadoop. I can't see how I can do this as the job is sent to the DB to run the query. Is there any work arounds or hold the data/table in memory to use across sessions?
Thanks
testing DB connections
The workaround is to create tables from the result/tables of the other databases. And then JOIN with these tables.
Please see: https://www.jetbrains.com/help/datagrip/tables-copy.html

hive, get the data location using an one-liner

I wonder if there is a way to get the data location from hive using a one-liner. Something like
select d.location from ( describe formatted table_name partition ( .. ) ) as d;
My current solution is to get the full output and then parse it.
Unlike traditional RDBMS, Hive metadata is stored in a separate database. In most cases it is in MySQL or Postgres. The metastore database details can be found in hive-site.conf. If you have access to the metastore database, you can run SELECT on table TBLS to get the details about the tables and COLUMNS_V2 to get the details about columns etc..
If you do not have access to the metastore, the only option is to describe each table to get the details. If you have a lot of databases and tables, you could write a shell script to get the list of tables using "show tables" and loop around the tables.
Two methods if you do not have access to the metadata.
Parse DESCRIBE TABLE in the shell like in this answer: https://stackoverflow.com/a/43804621/2700344
Also Hive has a virtual column INPUT__FILE__NAME.
select INPUT__FILE__NAME from table
will output locations URLs for each file.
You can split URL by '/', get element you need, aggregate, etc

How to transfer data & metadata from Hive to RDBMS

There are more than 300 tables in my hive environment.
I want to export all the tables from Hive to Oracle/MySql including metadata.
My Oracle database doesn't have any tables corresponding to these Hive tables.
Sqoop import from Oracle to Hive creates tables in Hive if the table doesn't exists.But Sqoop export from Hive to Oracle doesn't create table if not exists and fails with an exception.
Is there any option in Sqoop to export metadata also? or
Is there any other Hadoop tool through which I can achieve this?
Thanks in advance
The feature you're asking for isn't in Spark. I don't know of a current hadoop tool which can do what you're asking either unfortunately. A potential workaround is using the "show create table mytable" statement in Hive. It will return the create table statements. You can parse this manually or pragmatically via awk and get the create tables in a file, then run this file against your oracle db. From there, you can use sqoop to populate the tables.
It won't be fun.
Sqoop can't copy metadata or create table in RDBMS on the basis of Hive table.
Table must be there in RDBMS to perform sqoop export.
Why is it so?
Mapping from RDBMS to Hive is easy because hive have only few datatypes(10-15). Mapping from multiple RDBMS datatypes to Hive datatype is easily achievable. But vice versa is not that easy. Typical RDBMS has 100s of datatypes (that too different in different RDBMS).
Also sqoop export is newly added feature. This feature may come in future.

Presto and hive partition discovery

I'm using presto mainly with hive connector to connect to hive metastore.
All of my tables are external tables pointing to data stored in S3.
My main issue with this is that there is no way (at least on I'm aware of ) to do partition discovery in Presto ,so before I start query a table in presto I need to switch to hive and run msck repair table mytable
is there more reasonable way to do it in Presto?
I'm on version 0.227 and the following helps me:
select * from hive.yourschema."yourtable$partitions"
This select returns all the partitions mapped in your catalog. You can filter, order, etc. as a normal query would.
No.
If the HIVE metastore doesn't see the partitions, PrestoDB will not see it.
Maybe a cron can help you.
There is now a way to do this:
CALL system.sync_partition_metadata(schema_name=>'<your-schema>', table_name=>'<your-table>', mode=>'FULL')
Credit to this post and this video

Microstrategy - HBase connection

We are trying to connect MS 9.4 to HBase via Impala connector.
First we created the hive tables liking them to HBase tables with following create table (as we saw in the docs):
CREATE TABLE hiveTableName1
(key int, columnName1 codClient, columnName2 clientName)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,columnfamily1:columnName1,columnfamily1:columnName2")
TBLPROPERTIES ("hbase.table.name" = "hbaseTableName1");
We did this twice, since we want to crete two hive tables and their correspondent hbase tables, in order to perform a join between them later with MS.
For the connection between MS with HBase, we follow the steps by selecting the MicroStrategy ODBC Driver for Impala Wire Protocol, and filling in the Data Source Name (Impala Data Source previously created with the Impala Driver), host and port (both for Impala installation in our AWS infraestructure) and impala/impala for credentials.
The thing is that when we finish complete the wizard and select the default namespace (which is the only one available. No any other ns has been created), we can see the hive tables that we created before, instead of the hbase tables.
I mean:
hiveTableName1
hiveTableName2
instead of
hbaseTableName1
hbaseTableName2
And, since these are the only tables availables, we only can perform our report with these two tables: a very easy join between these two tables by one field.
Both tables have 200.000 records and the join takes more than 1 minute to complete.
I'm sure that we are missing something here, and the process of linking hive tables to hbase ones are not completely right.
Is there a way to be able to connect to these two hbase tables instead of hive ones?
Any help will be really appreciated.
1. HBase does not support SQL and does not support the concept of "join" anyway.
2. Mapping Hive tables on HBase tables means that every Hive query triggers a full scan on HBase side, then the result is fed to a MapReduce batch job that does the filters and the joins.
Bottom line: 1 min is quite fast for what you are doing.
If you expect sub-second results, try some "small data" technologies (e.g. MySQL, Oracle, even MS Access) or forget about joins.
For sub-minutes results, you might give a try to Apache Phoenix: it's a HBase wrapper with indexes and some kind of SQL. Not sure about ODBC/JDBC drivers though.

Resources