How do I update Apache Atlas metadata? - hadoop

I have a Hortonworks Sandbox.
I have run an Atlas application. There are already all databases, tables and columns from Hive. I have added a new table to Hive, but it didn't appear in Atlas automaticaly.
How do I update Atlas metadata? Is there any good tutorial for Atlas showing how to start e.g. How to import data from existing cluster?
Regards
Pawel

All metadata are reported to Atlas automatically. Hive should be running with atlas hook that is responsible for such reporting.
If you have hive installed as a part of hortonworks platform it should be there, otherwise there is a clear instruction in Apache Atlas documentation of how to install Hive Hook (this is kind of extra binary to be added to hive distribution).
In general Apache Atlas documentation is well maintained and cover most of cases.

Related

Kylin Additional Data Sources like SQL Server

I have a Kubernetes cluster with Kylin for Back-End and Superset as Front-End.
Everything works great for the example "Default" database within the Kylin application.
Now I am trying to add SQL Server database where I have added the following code into $KYLIN_HOME/conf/kylin.properties file:
kylin.source.default=8
kylin.source.jdbc.connection-url=jdbc:sqlserver://hostname:1433;database=sample
kylin.source.jdbc.driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
kylin.source.jdbc.dialect=mssql
kylin.source.jdbc.user=your_username
kylin.source.jdbc.pass=your_password
kylin.source.jdbc.sqoop-home=/usr/hdp/current/sqoop-client
kylin.source.jdbc.filed-delimiter=|
As documentation describes I also added the SQL-SERVER-JDBC-Database-Driver jar file into $KYLIN_HOME/ext/ directory.
In addition, the documentation also mentions installing SQOOP and add the SQL-SERVER-JDBC-Database-Driver jar file also in the $SQOOP_HOME/lib/ directory.
But inside the container I do not have pip to install it, so should I create a new image with pip and SQOOP installed? Is this the right way? And what Kylin needs?
UPDATE
After some investigation, managed to install also pip in case I needed it because originally I was thinking that I should install pysqoop which didn't work. Documentation is suggesting to install Apache SQOOP, and I am not sure what I should download and where to place the files.
Kylin has a document on Setup JDBC Data Source.
The sqoop is Apache Sqoop, a bulk data transferring tool on Hadoop. Written in Java, kylin and sqoop has no need for python and pip.
Suggest investigate further in the Hadoop world. :-)

Lineage is not visible for Hive Managed Table in HDP Atlas

I am using Atlas with HDP for creating the lineage flow for my hive tables but the lineage is only visible for the Hive External tables. I have created hive managed tables and perform a join operation to create a new table and imported the hive meta store using import-hive.sh placed under hook-bin folder. But the lineage for the managed table is not visible.
Even the HDFS directory is not listed for the managed table. But, if I check for the external table HDFS directory is available.
Can anyone help me over here? Thanks in advance.
There were two factors which were causing the issue in my case, first was with the Hive-Hook and second was with offsets.topic.replication.factor. To resolve this below steps were implemented:-
1) Re-install Hive Hook for the atlas
Grep the lists of the services installed for Apache Atlas and re-install the Hive-Hook jar
2) Kafka offset replication property
Changes the offsets.topic.replication.factor value to 1.
By implementing the above changes, lineage is reflecting in the Atlas itself for Hive as well as sqoop.

Based Apache Atlas with Hive, where is the metadata stored? in Titan Graph Repository or in RDBMS with Hive?

I have installed Atlas, Hive and Hadoop and configured them correctly.
But I want to know where the metadata is stored after importing metadata?
According some docs of Atlas, it said that the metadata will be stored in Titan graph repository.
However, according some docs of Hive, it shows that the metadata will be stored in RDBMS such as MySql.
If I install both Atlas and Hive, where the metadata will be stored spcifically?
Though the existing answer is not wrong, I think it is good to point out that the asker seems to be mixing up two kinds of metadata.
Hive metadata: This is indeed stored in a relational DB, with MySQL being the default
Atlas metadata: This is stored in HBase (Titan was backed by Hbase for older versions?)
An example of Hive metadata is 'what are the columns'
An example of Atlas metdata is 'how did I tag column a'
So if you install both Hive and Atlas, there will be two kinds of metadata, and this will be stored in the mentioned spots.
The metadata is stored in HBase database.The HBase database is maintained by Apache Atlas itself.

hcatUtil not found when Configuring HP Vertica for HCatalog

I am trying to configure HP Vertica for HCatalog:
Configuring HP Vertica for HCatalog
But I can not found hcatUtil on my Vertica cluster.
Where can I get this utility?
As this answer said, it's in /opt/vertica/packages/hcat/tools starting with version 7.1.1. But you probably need some further information:
You need to run hcatUtil on a node in your Hadoop cluster; the utility gathers up Hadoop libraries that Vertica also needs to access, so you need to have those libraries available. Assuming you're not co-locating Vertica nodes on your Hadoop nodes, the easiest way to do this is probably to copy the script to a Hadoop node, run it with output to a temporary directory, and then copy the contents of the temporary directory back to the Vertica node. (Put them in /opt/vertica/packages/hcat/lib.) Then proceed with installing the HCatalog connector.
See this section in the Vertica documentation for more details. (Link is to 7.2.x, but the process has been the same since the tool was introduced.)
The hcatUtil utility has been introduced in vertica 7.1.1 and is located at /opt/vertica/packages/hcat/tools. If you do not have it there, most likely you're using an older Vertica version.

Cassandra integration with Hadoop

I am newbie to Cassandra. I am posting this question as different documentations were providing different details with respect to integeting Hive with Cassandra and I was not able to find the github page.
I have installed a single node Cassandra 2.0.2 (Datastax Community Edition) in one of the data nodes of my 3 node HDP 2.0 cluster.
I am unable to use hive to access Cassandra using 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'. I am getting the error ' return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
I have copied all the jars in /$cassandra_home/lib/* to /$hive-home/lib and also included the /cassandra_home/lib/* in the $HADOOP_CLASSPATH.
Is there any other configuration changes that I have to make to integrate Cassandra with Hadoop/Hive?
Please let me know. Thanks for the help!
Thanks,
Arun
Probably these are starting points for you:
Hive support for Cassandra, github
Top level article related to your topic with general information: Hive support for Cassandra CQL3.
Hadoop support, Cassandra Wiki.
Actually your question is not so narrow, there could be lot of reasons for this. But what you should remember Hive is based on MapReduce engine.
Hope this helps.

Resources