Kylin Additional Data Sources like SQL Server - pip

I have a Kubernetes cluster with Kylin for Back-End and Superset as Front-End.
Everything works great for the example "Default" database within the Kylin application.
Now I am trying to add SQL Server database where I have added the following code into $KYLIN_HOME/conf/kylin.properties file:
kylin.source.default=8
kylin.source.jdbc.connection-url=jdbc:sqlserver://hostname:1433;database=sample
kylin.source.jdbc.driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
kylin.source.jdbc.dialect=mssql
kylin.source.jdbc.user=your_username
kylin.source.jdbc.pass=your_password
kylin.source.jdbc.sqoop-home=/usr/hdp/current/sqoop-client
kylin.source.jdbc.filed-delimiter=|
As documentation describes I also added the SQL-SERVER-JDBC-Database-Driver jar file into $KYLIN_HOME/ext/ directory.
In addition, the documentation also mentions installing SQOOP and add the SQL-SERVER-JDBC-Database-Driver jar file also in the $SQOOP_HOME/lib/ directory.
But inside the container I do not have pip to install it, so should I create a new image with pip and SQOOP installed? Is this the right way? And what Kylin needs?
UPDATE
After some investigation, managed to install also pip in case I needed it because originally I was thinking that I should install pysqoop which didn't work. Documentation is suggesting to install Apache SQOOP, and I am not sure what I should download and where to place the files.

Kylin has a document on Setup JDBC Data Source.
The sqoop is Apache Sqoop, a bulk data transferring tool on Hadoop. Written in Java, kylin and sqoop has no need for python and pip.
Suggest investigate further in the Hadoop world. :-)

Related

How to connect to an oracle database using databricks

I'm trying to connect to oracle database using pyspark in databricks notebook. I can not find any documentation to install the library for the driver on the cluster
many thanks in advance.
If it is an interactive cluster I'd use maven for the installation. You can specify the Coordinates or search the package you want to install using the UI

How do I update Apache Atlas metadata?

I have a Hortonworks Sandbox.
I have run an Atlas application. There are already all databases, tables and columns from Hive. I have added a new table to Hive, but it didn't appear in Atlas automaticaly.
How do I update Atlas metadata? Is there any good tutorial for Atlas showing how to start e.g. How to import data from existing cluster?
Regards
Pawel
All metadata are reported to Atlas automatically. Hive should be running with atlas hook that is responsible for such reporting.
If you have hive installed as a part of hortonworks platform it should be there, otherwise there is a clear instruction in Apache Atlas documentation of how to install Hive Hook (this is kind of extra binary to be added to hive distribution).
In general Apache Atlas documentation is well maintained and cover most of cases.

Apache Kylin installation without Sandbox

I was wondering if there are any resources regarding Apache Kylin installation without any sandbox (like cloudera, hortonworks) support. I have managed to do the following:
Install Hadoop 2.6
Install Hive
Install HBase
Then I used the binary from kylin site and so far been able to run it. The problem start when I try to build a cube, the map reduce job gets stuck in step 2. I am thinking if it is still assuming to be in sandbox mode and not submitting job to hadoop at all (there is no entry in hadoop jobtracker).
So I need solution regarding the two:
1. Possible configuration of kylin in pure hadoop setup (no sandbox)
2. somehow enable the kylin setup to submit job to hadoop.
There is no such sandbox or non-sandbox configuration in Kylin. Just make sure the machine where Kylin runs has hadoop setup correctly and you should be fine.
Under the scene, kylin.sh uses hbase classpath and hive -e set | grep 'env:CLASSPATH' to detect hadoop settings. Double check these commands work as expect if you are not sure what cluster Kylin connects to.
If Kylin has problem submitting MR jobs, check two places. First is hadoop resource manager, see if the job has really been submitted or not. Sometimes it's just running slow. Second check kylin.log, see if any exception there. Post the log to kylin dev mailing list and someone will be able to help.
You can install hadoop-2.6 , hive-0.14 ,hbase-0.98.8-hadoop2 with Zookeeper inbuilt or external zookeeper-3.5
Now you can run kylin-v1.1-release on it
If you still face Issues paste the log here

hcatUtil not found when Configuring HP Vertica for HCatalog

I am trying to configure HP Vertica for HCatalog:
Configuring HP Vertica for HCatalog
But I can not found hcatUtil on my Vertica cluster.
Where can I get this utility?
As this answer said, it's in /opt/vertica/packages/hcat/tools starting with version 7.1.1. But you probably need some further information:
You need to run hcatUtil on a node in your Hadoop cluster; the utility gathers up Hadoop libraries that Vertica also needs to access, so you need to have those libraries available. Assuming you're not co-locating Vertica nodes on your Hadoop nodes, the easiest way to do this is probably to copy the script to a Hadoop node, run it with output to a temporary directory, and then copy the contents of the temporary directory back to the Vertica node. (Put them in /opt/vertica/packages/hcat/lib.) Then proceed with installing the HCatalog connector.
See this section in the Vertica documentation for more details. (Link is to 7.2.x, but the process has been the same since the tool was introduced.)
The hcatUtil utility has been introduced in vertica 7.1.1 and is located at /opt/vertica/packages/hcat/tools. If you do not have it there, most likely you're using an older Vertica version.

Cascading HBase Tap

I am trying to write Scalding jobs which have to connect to HBase, but I have trouble using the HBase tap. I have tried using the tap provided by Twitter Maple, following this example project, but it seems that there is some incompatibility between the Hadoop/HBase version that I am using and the one that was used as client by Twitter.
My cluster is running Cloudera CDH4 with HBase 0.92 and Hadoop 2.0.0-cdh4.1.3. Whenever I launch a Scalding job connecting to HBase, I get the exception
java.lang.NoSuchMethodError: org.apache.hadoop.net.NetUtils.getInputStream(Ljava/net/Socket;)Ljava/io/InputStream;
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:363)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1046)
...
It seems that the HBase client used by Twitter Maple is expecting some method on NetUtils that does not exist on the version of Hadoop deployed on my cluster.
How do I track down what exactly is the mismatch - what version would the HBase client expect and so on? Is there in general a way to mitigate these issues?
It seems to me that often client libraries are compiled with hardcoded version of the Hadoop dependencies, and it is hard to make those match the actual versions deployed.
The method actually exists but has changed its signature. Basically, it boils down to having different versions of Hadoop libraries on your client and server. If your server is running Cloudera, you should be using the HBase and Hadoop libraries from Cloudera. If you're using Maven, you can use Cloudera's Maven repository.
It seems like library dependencies are handled in Build.scala. I haven't used Scala yet, so I'm not entirely sure how to fix it there.
The change that broke compatibility was committed as part of HADOOP-8350. Take a look at Ted Yu's comments and the responses. He works on HBase and had the same issue. Later versions of the HBase libraries should automatically handle this issue, according to his comment.

Resources