Based Apache Atlas with Hive, where is the metadata stored? in Titan Graph Repository or in RDBMS with Hive? - hadoop

I have installed Atlas, Hive and Hadoop and configured them correctly.
But I want to know where the metadata is stored after importing metadata?
According some docs of Atlas, it said that the metadata will be stored in Titan graph repository.
However, according some docs of Hive, it shows that the metadata will be stored in RDBMS such as MySql.
If I install both Atlas and Hive, where the metadata will be stored spcifically?

Though the existing answer is not wrong, I think it is good to point out that the asker seems to be mixing up two kinds of metadata.
Hive metadata: This is indeed stored in a relational DB, with MySQL being the default
Atlas metadata: This is stored in HBase (Titan was backed by Hbase for older versions?)
An example of Hive metadata is 'what are the columns'
An example of Atlas metdata is 'how did I tag column a'
So if you install both Hive and Atlas, there will be two kinds of metadata, and this will be stored in the mentioned spots.

The metadata is stored in HBase database.The HBase database is maintained by Apache Atlas itself.

Related

Apache Impala metadata - saparate from Hive or is it the same?

Pretty new to Hadoop ecosystem services here, starting to learn stuff bit by bit.
I tried to find out about Apache Impala and its metadata store. The documentation says there's a daemon specifically for it and how it's gathered for each table. Other sources say that Hive metadata store is in fact also the Impala metadata store.
I can't find a single source for best practices or any type of user guide type document for handling metadata in Impala, is it because the metadata store is in fact the Hive store?
Just generally confused by the lack of information out there.

Lineage is not visible for Hive Managed Table in HDP Atlas

I am using Atlas with HDP for creating the lineage flow for my hive tables but the lineage is only visible for the Hive External tables. I have created hive managed tables and perform a join operation to create a new table and imported the hive meta store using import-hive.sh placed under hook-bin folder. But the lineage for the managed table is not visible.
Even the HDFS directory is not listed for the managed table. But, if I check for the external table HDFS directory is available.
Can anyone help me over here? Thanks in advance.
There were two factors which were causing the issue in my case, first was with the Hive-Hook and second was with offsets.topic.replication.factor. To resolve this below steps were implemented:-
1) Re-install Hive Hook for the atlas
Grep the lists of the services installed for Apache Atlas and re-install the Hive-Hook jar
2) Kafka offset replication property
Changes the offsets.topic.replication.factor value to 1.
By implementing the above changes, lineage is reflecting in the Atlas itself for Hive as well as sqoop.

Can NiFi - SelectHiveQL reads the data from a table on CDH cluster in parquet format?

I have a use case where i have to move data from inhouse CDH cluster to AWS EMR cluster.
I am thinking to setup NiFi on AWS EC2 instance to moves the data from inhouse cluster to AWS s3 storage.
My all tables on CDH cluster are stored in parquet format.
Question#1:
Do we have support in NiFi that allows to read tables in parquet format??
OR
The only option that i have is to read data directly from hdfs directory and place it on s3 and then create hive table in EMR?
Question#2: How Nifi determines new data inserted into the table and reads new data. In my case all tables are partitioned by yyyymm.
If you use SelectHiveQL, it can read anything Hive can (including Parquet), all the conversion work is done in Hive and is returned through the JDBC driver as a ResultSet, so you'll get the data out as Avro or CSV depending on what you set as the Output Format property in SelectHiveQL.
Having said that, your CDH would need a Hive version of at least 1.2.1, I've seen quite a few questions about compatibility where CDH has Hive 1.1.x, which NiFi does not support with the Hive processors. For that you'd need something like the Simba JDBC driver (not the Apache Hive JDBC driver, it doesn't implement all the necessary JDBC methods) and you can use ExecuteSQL and other SQL processors with the JDBC driver.

How do I update Apache Atlas metadata?

I have a Hortonworks Sandbox.
I have run an Atlas application. There are already all databases, tables and columns from Hive. I have added a new table to Hive, but it didn't appear in Atlas automaticaly.
How do I update Atlas metadata? Is there any good tutorial for Atlas showing how to start e.g. How to import data from existing cluster?
Regards
Pawel
All metadata are reported to Atlas automatically. Hive should be running with atlas hook that is responsible for such reporting.
If you have hive installed as a part of hortonworks platform it should be there, otherwise there is a clear instruction in Apache Atlas documentation of how to install Hive Hook (this is kind of extra binary to be added to hive distribution).
In general Apache Atlas documentation is well maintained and cover most of cases.

Questions about Hadoop And Hive And Presto

I am looking into using Hive on our Hadoop cluster to then use Presto to do some analytics on the data stored in Hadoop but I am still confused about some things:
Files are stored in Hadoop (some kind of file manager)
Hive needs tables to store data from Hadoop (data manager)
Do both Hadoop and Hive store their data separate or does Hive just use the files from Hadoop? (in terms of hard disk space and so on?)
-> So does Hive import data from Hadoop in tables and leave Hadoop alone or how must I see this?
Can Presto be used without Hive and just on Hadoop directly?
Thanks in advance for answering my questions :)
First things first: files are stored in Hadoop Distributed File System (HDFS). Is that what you call Data manager?
Actually Hive can use both - "regular" files in HDFS or tables which are once again "regular" files with additional metadata stored in special datastore (it is called warehouse).
Concerning Presto - it has a built-in support for Hive metastore, but you can also write your own connector plugin for any data source.
Please read more info about Hive connector configuration here and about connector plugins here.

Resources