Apache Zeppelin configuration for connect to Hive on HDP Virtualbox - jdbc

I've been struggling with the Apache Zeppelin notebook version 0.10.0 setup for a while.
The idea is to be able to connect it to a remote Hortonworks 2.6.5 server that runs locally on Virtualbox in Ubuntu 20.04.
I am using an image downloaded from the:
https://www.cloudera.com/downloads/hortonworks-sandbox.html
Of course, the image has pre-installed Zeppelin which works fine on port 9995, but this is an old 0.7.3 version that doesn't support Helium plugins that I would like to use. I know that HDP version 3.0.1 has updated Zeppelin version 0.8 onboard, but its use due to my hardware resource is impossible at the moment. Additionally, from what I remember, enabling Leaflet Map Plugin there was a problem either.
The first thought was to update the notebook on the server, but after updating according to the instructions on the Cloudera forums (unfortunately they are not working at the moment, and I cannot provide a link or see any other solution) it failed to start correctly.
A simpler solution seemed to me now to connect the newer notebook version to the virtual server, unfortunately, despite many attempts and solutions from threads here with various configurations, I was not able to connect to Hive via JDBC. I am using Zeppelin with local Spark 3.0.3 too, but I have some geodata in Hive that I would like to visualize this way.
I used, among others, the description on the Zeppelin website:
https://zeppelin.apache.org/docs/latest/interpreter/jdbc.html#apache-hive
This is my current JDBC interpreter configuration:
hive.driver org.apache.hive.jdbc.HiveDriver
hive.url jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
hive.user hive
Artifact org.apache.hive:hive-jdbc:3.1.2
Depending on the driver version, there were different errors, but this time after typing:
%jdbc(hive)
SELECT * FROM mydb.mytable;
I get the following error:
Could not open client transport for any of the Server URI's in
ZooKeeper: Could not establish connection to
jdbc:hive2://sandbox-hdp.hortonworks.com:10000/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;hive.server2.proxy.user=hive;?tez.application.tags=paragraph_1645270946147_194101954;mapreduce.job.tags=paragraph_1645270946147_194101954;:
Required field 'client_protocol' is unset!
Struct:TOpenSessionReq(client_protocol:null,
configuration:{set:hiveconf:mapreduce.job.tags=paragraph_1645270946147_194101954,
set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000,
hive.server2.proxy.user=hive, use:database=default,
set:hiveconf:tez.application.tags=paragraph_1645270946147_194101954})
I will be very grateful to everyone for any help. Regards.

So, after many hours and trials, here's a working solution. First of all, the most important thing is to use drivers that correlate with your version of Hadoop. Needed are jar files like 'hive-jdbc-standalone' and 'hadoop-common' in their respective versions and to avoid adding all of them in the 'Artifact' field of the %jdbc interpreter in Zeppelin it is best to use one complete file containing all required dependencies.
Thanks to Tim Veil it is available in his Github repository below:
https://github.com/timveil/hive-jdbc-uber-jar/
This is my complete Zeppelin %jdbc interpreter settings:
default.url jdbc:postgresql://localhost:5432/
default.user gpadmin
default.password
default.driver org.postgresql.Driver
default.completer.ttlInSeconds 120
default.completer.schemaFilters
default.precode
default.statementPrecode
common.max_count 1000
zeppelin.jdbc.auth.type SIMPLE
zeppelin.jdbc.auth.kerberos.proxy.enable false
zeppelin.jdbc.concurrent.use true
zeppelin.jdbc.concurrent.max_connection 10
zeppelin.jdbc.keytab.location
zeppelin.jdbc.principal
zeppelin.jdbc.interpolation false
zeppelin.jdbc.maxConnLifetime -1
zeppelin.jdbc.maxRows 1000
zeppelin.jdbc.hive.timeout.threshold 60000
zeppelin.jdbc.hive.monitor.query_interval 1000
hive.driver org.apache.hive.jdbc.HiveDriver
hive.password
hive.proxy.user.property hive.server2.proxy.user
hive.splitQueries true
hive.url jdbc:hive2://sandbox-hdp.hortonworks.com:10000/default
hive.user hive
Dependencies
Artifact
/opt/zeppelin/interpreter/jdbc/hive-jdbc-uber-2.6.5.0-292.jar
Next step is to go to Ambari http://localhost:8080/ and login as admin. To do that first you must login on Hadoop root account via SSH:
ssh root#127.0.0.1 -p 2222
root#127.0.0.1's password: hadoop
After successful login, you will be prompted to change your password immediately, please do that and next set Ambari admin password with command:
[root#sandbox-hdp ~]# ambari-admin-password-reset
After that you can use admin account in Ambari (login and click Hive link in the left panel):
Ambari -> Hive -> Configs -> Advanced -> Custom hive-site
Click Add Property
Insert followings into the opening window:
hive.security.authorization.sqlstd.confwhitelist.append=tez.application.tags
And after saving, restart all Hive services in Ambari. Everything should be working now if you set the proper Java path in 'zeppelin-env.sh' and port in 'zeppelin-site.xml' (you must copy and rename 'zeppelin-env.sh.template' and 'zeppelin-site.xml.template' in Zeppelin/config directory, please remember that Ambari also use 8080 port!).
In my case, the only thing left to do is add or uncomment the fragment responsible for the Helium plug-in repository (in 'zeppelin-site.xml'):
<property>
<name>zeppelin.helium.registry</name>
<value>helium,https://s3.amazonaws.com/helium-package/helium.json</value>
<description>Enable helium packages</description>
</property>
Now you can go to the Helium tab in the top right corner of the Zeppelin sheet and install the plugins of your choice, in my case it is 'zeppelin-leaflet' visualization. And voilĂ ! Sample vizualization from this Kaggle dataset in Hive:
https://www.kaggle.com/kartik2112/fraud-detection
Have a nice day!

Related

Confluent Kafka Connect Elasticsearch connector installation

I'm trying to install Elasticsearch connector to Confluent Kafka Connect. I'm following below instruction:
https://docs.confluent.io/kafka-connect-elasticsearch/current/index.html#install-the-connector-using-c-hub
after executing:
confluent-hub install confluentinc/kafka-connect-elasticsearch:latest
everything seems fine. See below result:
[ec2-user#ip-172-31-16-76 confluent-6.1.0]$ confluent-hub install confluentinc/kafka-connect-elasticsearch:latest
The component can be installed in any of the following Confluent Platform installations:
1. /home/ec2-user/confluent-6.1.0 (based on $CONFLUENT_HOME)
2. /home/ec2-user/confluent-6.1.0 (found in the current directory)
3. /home/ec2-user/confluent-6.1.0 (where this tool is installed)
Choose one of these to continue the installation (1-3): 2
Do you want to install this into /home/ec2-user/confluent-6.1.0/share/confluent-hub-components? (yN) y
Component's license:
Confluent Community License
http://www.confluent.io/confluent-community-license
I agree to the software license agreement (yN) y
Downloading component Kafka Connect Elasticsearch 11.0.3, provided by Confluent, Inc. from Confluent Hub and installing into /home/ec2-user/confluent-6.1.0/share/confluent-hub-components
Do you want to uninstall existing version 11.0.3? (yN) y
Detected Worker's configs:
1. Standard: /home/ec2-user/confluent-6.1.0/etc/kafka/connect-distributed.properties
2. Standard: /home/ec2-user/confluent-6.1.0/etc/kafka/connect-standalone.properties
3. Standard: /home/ec2-user/confluent-6.1.0/etc/schema-registry/connect-avro-distributed.properties
4. Standard: /home/ec2-user/confluent-6.1.0/etc/schema-registry/connect-avro-standalone.properties
5. Based on CONFLUENT_CURRENT: /tmp/confluent.424339/connect/connect.properties
6. Used by Connect process with PID : /tmp/confluent.424339/connect/connect.properties
Do you want to update all detected configs? (yN) y
Adding installation directory to plugin path in the following files:
/home/ec2-user/confluent-6.1.0/etc/kafka/connect-distributed.properties
/home/ec2-user/confluent-6.1.0/etc/kafka/connect-standalone.properties
/home/ec2-user/confluent-6.1.0/etc/schema-registry/connect-avro-distributed.properties
/home/ec2-user/confluent-6.1.0/etc/schema-registry/connect-avro-standalone.properties
/tmp/confluent.424339/connect/connect.properties
/tmp/confluent.424339/connect/connect.properties
Completed
However, when I'm trying to list all avilable connectors I'm getting below list:
[ec2-user#ip-172-31-16-76 confluent-6.1.0]$ confluent local services connect connector list
The local commands are intended for a single-node development environment only,
NOT for production usage. https://docs.confluent.io/current/cli/index.html
Bundled Connectors:
file-sink
file-source
replicator
As per instruction in link above I would expect to see elasticsearch-sink . Unofrtunetly, no such entry avilable.
It seems I'm missing something simple but I don't see any explenation in instruction. Any help would be appreciated.
EDIT 1
Below you can see result of curl -s localhost:8083/connector-plugins
[
{"class":"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector","type":"sink","version":"11.0.3"},
{"class":"io.confluent.connect.replicator.ReplicatorSourceConnector","type":"source","version":"6.1.0"},
{"class":"io.confluent.kafka.connect.datagen.DatagenConnector","type":"source","version":"null"},
{"class":"org.apache.kafka.connect.file.FileStreamSinkConnector","type":"sink","version":"6.1.0-ce"},
{"class":"org.apache.kafka.connect.file.FileStreamSourceConnector","type":"source","version":"6.1.0-ce"},
{"class":"org.apache.kafka.connect.mirror.MirrorCheckpointConnector","type":"source","version":"1"},
{"class":"org.apache.kafka.connect.mirror.MirrorHeartbeatConnector","type":"source","version":"1"},
{"class":"org.apache.kafka.connect.mirror.MirrorSourceConnector","type":"source","version":"1"}
]
curl -s localhost:8083/connector-plugins gives the definitive response from the worker what plugins are installed.
Per the output in your question, the Elasticsearch sink connector is now installed in your connector. I don't know why the Confluent CLI would not show this.

How to safely fix an AWOL ambari system user?

I'm a student working on a test cluster, consisting of around 25 hosts. We installed using Ambari and have FreeIpa running on a host as a dns and ldap server. The rest are typical Hadoop
infrastructure. Hive was failing and I wondered whether the db connection parameters used during the Ambari installation were incorrect and I tried to find a way to re-run the db connection process. I didn't get anywhere and it was late so I left it, ambari interface working.
Next morning, ambari webUI seems to be down. I thought that maybe the webserver needed restarted so I tried the following:
[akidd#dw ~]$ sudo ambari-server start
Using python /usr/bin/python
Starting ambari-server
ERROR: Exiting with exit code 1.
REASON: Unable to detect a system user for Ambari Server.
- If this is a new setup, then run the "ambari-server setup" command to create the user
- If this is an upgrade of an existing setup, run the "ambari-server upgrade" command.
Refer to the Ambari documentation for more information on setup and upgrade.
Can anyone help me to understand what could have happened?
If I run ambari-server setup will the existing cluster be ok assuming I create everything like for like with how it was originally?
Thanks for your help!
#user3535074 You should try to start it with the user that installed it.
If you do run ambari-server setup as current user, remember to choose No the following options:
Customize user account for ambari-server daemon [y/n] (n)? n
Do you want to change Oracle JDK [y/n] (n)? n
Enter advanced database configuration [y/n] (n)? n
More info on the following post, including how to backup ambari database before running setup again:
https://community.cloudera.com/t5/Support-Questions/Ambari-server-failed-to-start-after-system-reboot-Below-is/td-p/203806

"Select Version" is empty after successfully installed and setup Apache Ambari from source

I managed to build, install and setup Apache Ambari successfully on my Debian 10 machine.
I am able to launch the Ambari admin UI and see the wizard.
However in the second step Select Version of the wizard, I see no choice to select:
What am I suppose to do here?
Thanks!
#wxh You need to install a Stack Repository in order to provide ambari with components and services to install. For example: HDP2 or HDP3 or HDF.
Debian10 repositories are not build, but you can try the Debian 9 ones:
https://docs.cloudera.com/HDPDocuments/Ambari-2.7.3.0/bk_ambari-installation/content/hdp_31_repositories.html
If you are installing from source you may not want to use HDP/HDF which opens up a huge can of worms for making your own stack.

Cloudera install agent issue

I have installed cm6 already, and want to install cloudera manager agent from custom repository and CDH6 with using packages.
(I work with only one host)
I have files for cloudera manager agent in directory /cloudera/cloudera-repo/cm6/6.0.1 and for CDH6 in directory /cloudera/cloudera-repo/cdh6/6.0.1
My steps for Cloudera Manager Agent:
Custom repository -> choose http://ip_addr/cloudera/cloudera-repo/cm6/6.0.1
For CDH and other software:
Install Method -> Use Packages
CDH Version -> CDH6
CDH Minor Version -> choose http://ip_addr/cloudera/cloudera-repo/cdh6/6.0.1
And on page Install Agents I have such error:
Failed to copy installation files
/tmp/scm_prepare_node.xpsM8dvM
Connection refused (Connection refused)
I have same error even when I specify empty directories. Why?
From the error, it seems that you have not provided proper credentials to connect to your host. The ssh credentials seems to be incorrect. If you are sure, ssh credentials are fine, then it is a firewall issue. You need to make sure all the required ports are enabled and no blocker is there for cloudera to install the agent.

How to add DataNode to Cloudera hadoop

I am trying to add a Datanode to my existing Single Datanode. Since my Unix server does not have access to Internet , Cloudera Manager is unable to perform the installation as it throws below error. Is there any other CLI Method to Add Data Node instead of CM?
BEGIN yum info jdk
Loaded plugins: product-id, subscription-manager
Updating Red Hat repositories.
http://archive.cloudera.com/cm4/redhat/6/x86_64/cm/4.7.2/repodata/repomd.xml: [Errno 14] PYCURL ERROR 6 - "Couldn't resolve host 'archive.cloudera.com'"
Yes, there are two approaches (I'm keeping this Cloudera-specific since this is what you mentioned).
1) Download tarballs and install everything manually. There is a guide available here, and I think it is not a good candidate to be copied here on Stack Overflow, because it is very long and vendor-specific (if the document moves, the title is "Installation Path C - Installation Using Tarballs").
2) For large internal installations, you may consider setting up your own repositories with rpm packages that you can access with yum. For this you'll need to edit /etc/yum.repos.d and link to some accessible host that's going to be your repo server (of course, you'll have to put your files there in advance). More details here. You can download rpms here. I have never had to do this myself, so I help this will point you in the right direction.

Resources