Lineage is not visible for Hive Managed Table in HDP Atlas - hadoop

I am using Atlas with HDP for creating the lineage flow for my hive tables but the lineage is only visible for the Hive External tables. I have created hive managed tables and perform a join operation to create a new table and imported the hive meta store using import-hive.sh placed under hook-bin folder. But the lineage for the managed table is not visible.
Even the HDFS directory is not listed for the managed table. But, if I check for the external table HDFS directory is available.
Can anyone help me over here? Thanks in advance.

There were two factors which were causing the issue in my case, first was with the Hive-Hook and second was with offsets.topic.replication.factor. To resolve this below steps were implemented:-
1) Re-install Hive Hook for the atlas
Grep the lists of the services installed for Apache Atlas and re-install the Hive-Hook jar
2) Kafka offset replication property
Changes the offsets.topic.replication.factor value to 1.
By implementing the above changes, lineage is reflecting in the Atlas itself for Hive as well as sqoop.

Related

Based Apache Atlas with Hive, where is the metadata stored? in Titan Graph Repository or in RDBMS with Hive?

I have installed Atlas, Hive and Hadoop and configured them correctly.
But I want to know where the metadata is stored after importing metadata?
According some docs of Atlas, it said that the metadata will be stored in Titan graph repository.
However, according some docs of Hive, it shows that the metadata will be stored in RDBMS such as MySql.
If I install both Atlas and Hive, where the metadata will be stored spcifically?
Though the existing answer is not wrong, I think it is good to point out that the asker seems to be mixing up two kinds of metadata.
Hive metadata: This is indeed stored in a relational DB, with MySQL being the default
Atlas metadata: This is stored in HBase (Titan was backed by Hbase for older versions?)
An example of Hive metadata is 'what are the columns'
An example of Atlas metdata is 'how did I tag column a'
So if you install both Hive and Atlas, there will be two kinds of metadata, and this will be stored in the mentioned spots.
The metadata is stored in HBase database.The HBase database is maintained by Apache Atlas itself.

Unable to retain HIVE tables

I have set up a single node hadoop cluster on ubuntu.I have installed hadoop 2.6 version on in my machine.
Problem:
Everytime i create HIVE tables and load data into it , i can see the data by querying on it but once i shut-down my hadoop , tables gets wiped out. Is there any way i can retain them or is there any setting i am missing?
I tried some online solution provided , but nothing worked , kindly help me out with this.
Blockquote
Thanks
B
The hive table data is on the hadoop hdfs, hive just add a meta data and provide users sql like commands to prevent them from writing basic MR jobs.So if you shutdown the hadoop cluster,Hive cant find the data in the table.
But if you are saying when you restart the hadoop cluster, the data is lost.That's another problem.
seems you are using default derby as metastore.configure the hive metastore.i am pointing the link.please fallow it.
Hive is not showing tables

HCatalog/Hive table creation does not import data into /app/hive/warehouse folder in hadoop cluster

I ran into a very weird problem within a hadoop cluster (HDP 2.2) I setup in Amazon EC2 (3 data nodes + one name node + one secondary name node). Hue server runs on the main name node and hive server runs on the secondary name node. I was using Hue web interface to create table "mytable" in HCatalog using a CSV file loaded into HDFS. The table creation returned successfully without error. The table was created and displayed in the Hue web interface. However, when I tried to query the table, it returned 0 record. I went to the /app/hive/warehouse folder, I could see the table folder "mytable" was created, but the CSV file was never copied into that folder. I reproduced the same behavior using hive shell.
If I does the same operation in the HDP sandbox VM, everything works as expected. After the table creation, the /app/hive/warehouse/mytable folder contains the CSV file I imported into the table.
Any help is highly appreciated.
I solved the issue. I realized the server in the cluster with the hive server running is low on physical memory. After free up some memory on the box, the hcatalog table creation operation worked as expected.

Questions about Hadoop And Hive And Presto

I am looking into using Hive on our Hadoop cluster to then use Presto to do some analytics on the data stored in Hadoop but I am still confused about some things:
Files are stored in Hadoop (some kind of file manager)
Hive needs tables to store data from Hadoop (data manager)
Do both Hadoop and Hive store their data separate or does Hive just use the files from Hadoop? (in terms of hard disk space and so on?)
-> So does Hive import data from Hadoop in tables and leave Hadoop alone or how must I see this?
Can Presto be used without Hive and just on Hadoop directly?
Thanks in advance for answering my questions :)
First things first: files are stored in Hadoop Distributed File System (HDFS). Is that what you call Data manager?
Actually Hive can use both - "regular" files in HDFS or tables which are once again "regular" files with additional metadata stored in special datastore (it is called warehouse).
Concerning Presto - it has a built-in support for Hive metastore, but you can also write your own connector plugin for any data source.
Please read more info about Hive connector configuration here and about connector plugins here.

hadoop hive question

I'm trying to create tables pragmatically using JDBC. However, I can't really see the table I created from the hive shell. What's worse, when i access hive shell from different directories, i see different result of the database.
Is any setting i need to configure?
Thanks in advance.
Make sure you run hive from the same directory every time because when you launch hive CLI for the first time, it creates a metastore derby db in the current directory. This derby DB contains metadata of hive tables. If you change directories, you will have unorganized metadata for hive tables. Also the Derby DB cannot handle multiple sessions. To allow for concurrent Hive access you would need to use a real database to manage the Metastore rather than the wimpy little derbyDB that comes with it. You can download mysql for this and change hive properties for jdbc connection to mysql type 4 pure java driver.
Try emailing the Hive userlist or the IRC channel.
You probably need to setup the central Hive metastore (by default, Derby, but it can be mySQL/Oracle/Postgres). The metastore is the "glue" between Hive and HDFS. It tells Hive where your data files live in HDFS, what type of data they contain, what tables they belong to, etc.
For more information, see http://wiki.apache.org/hadoop/HiveDerbyServerMode
Examine your hadoop logs. For me this happened when my hadoop system was not setup properly. The namenode was not able to contact the datanodes on other machines etc.
Yeah, it's due to the metastore not being set up properly. Metastore stores the metadata associated with your Hive table (e.g. the table name, table location, column names, column types, bucketing/sorting information, partitioning information, SerDe information, etc.).
The default metastore is an embedded Derby database which can only be used by one client at any given time. This is obviously not good enough for most practical purposes. You, like most users, should configure your Hive installation to use a different metastore. MySQL seems to be a popular choice. I have used this link from Cloudera's website to successfully configure my MySQL metastore.

Resources