How to add DataNode to Cloudera hadoop - hadoop

I am trying to add a Datanode to my existing Single Datanode. Since my Unix server does not have access to Internet , Cloudera Manager is unable to perform the installation as it throws below error. Is there any other CLI Method to Add Data Node instead of CM?
BEGIN yum info jdk
Loaded plugins: product-id, subscription-manager
Updating Red Hat repositories.
http://archive.cloudera.com/cm4/redhat/6/x86_64/cm/4.7.2/repodata/repomd.xml: [Errno 14] PYCURL ERROR 6 - "Couldn't resolve host 'archive.cloudera.com'"

Yes, there are two approaches (I'm keeping this Cloudera-specific since this is what you mentioned).
1) Download tarballs and install everything manually. There is a guide available here, and I think it is not a good candidate to be copied here on Stack Overflow, because it is very long and vendor-specific (if the document moves, the title is "Installation Path C - Installation Using Tarballs").
2) For large internal installations, you may consider setting up your own repositories with rpm packages that you can access with yum. For this you'll need to edit /etc/yum.repos.d and link to some accessible host that's going to be your repo server (of course, you'll have to put your files there in advance). More details here. You can download rpms here. I have never had to do this myself, so I help this will point you in the right direction.

Related

Apache Zeppelin configuration for connect to Hive on HDP Virtualbox

I've been struggling with the Apache Zeppelin notebook version 0.10.0 setup for a while.
The idea is to be able to connect it to a remote Hortonworks 2.6.5 server that runs locally on Virtualbox in Ubuntu 20.04.
I am using an image downloaded from the:
https://www.cloudera.com/downloads/hortonworks-sandbox.html
Of course, the image has pre-installed Zeppelin which works fine on port 9995, but this is an old 0.7.3 version that doesn't support Helium plugins that I would like to use. I know that HDP version 3.0.1 has updated Zeppelin version 0.8 onboard, but its use due to my hardware resource is impossible at the moment. Additionally, from what I remember, enabling Leaflet Map Plugin there was a problem either.
The first thought was to update the notebook on the server, but after updating according to the instructions on the Cloudera forums (unfortunately they are not working at the moment, and I cannot provide a link or see any other solution) it failed to start correctly.
A simpler solution seemed to me now to connect the newer notebook version to the virtual server, unfortunately, despite many attempts and solutions from threads here with various configurations, I was not able to connect to Hive via JDBC. I am using Zeppelin with local Spark 3.0.3 too, but I have some geodata in Hive that I would like to visualize this way.
I used, among others, the description on the Zeppelin website:
https://zeppelin.apache.org/docs/latest/interpreter/jdbc.html#apache-hive
This is my current JDBC interpreter configuration:
hive.driver org.apache.hive.jdbc.HiveDriver
hive.url jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
hive.user hive
Artifact org.apache.hive:hive-jdbc:3.1.2
Depending on the driver version, there were different errors, but this time after typing:
%jdbc(hive)
SELECT * FROM mydb.mytable;
I get the following error:
Could not open client transport for any of the Server URI's in
ZooKeeper: Could not establish connection to
jdbc:hive2://sandbox-hdp.hortonworks.com:10000/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;hive.server2.proxy.user=hive;?tez.application.tags=paragraph_1645270946147_194101954;mapreduce.job.tags=paragraph_1645270946147_194101954;:
Required field 'client_protocol' is unset!
Struct:TOpenSessionReq(client_protocol:null,
configuration:{set:hiveconf:mapreduce.job.tags=paragraph_1645270946147_194101954,
set:hiveconf:hive.server2.thrift.resultset.default.fetch.size=1000,
hive.server2.proxy.user=hive, use:database=default,
set:hiveconf:tez.application.tags=paragraph_1645270946147_194101954})
I will be very grateful to everyone for any help. Regards.
So, after many hours and trials, here's a working solution. First of all, the most important thing is to use drivers that correlate with your version of Hadoop. Needed are jar files like 'hive-jdbc-standalone' and 'hadoop-common' in their respective versions and to avoid adding all of them in the 'Artifact' field of the %jdbc interpreter in Zeppelin it is best to use one complete file containing all required dependencies.
Thanks to Tim Veil it is available in his Github repository below:
https://github.com/timveil/hive-jdbc-uber-jar/
This is my complete Zeppelin %jdbc interpreter settings:
default.url jdbc:postgresql://localhost:5432/
default.user gpadmin
default.password
default.driver org.postgresql.Driver
default.completer.ttlInSeconds 120
default.completer.schemaFilters
default.precode
default.statementPrecode
common.max_count 1000
zeppelin.jdbc.auth.type SIMPLE
zeppelin.jdbc.auth.kerberos.proxy.enable false
zeppelin.jdbc.concurrent.use true
zeppelin.jdbc.concurrent.max_connection 10
zeppelin.jdbc.keytab.location
zeppelin.jdbc.principal
zeppelin.jdbc.interpolation false
zeppelin.jdbc.maxConnLifetime -1
zeppelin.jdbc.maxRows 1000
zeppelin.jdbc.hive.timeout.threshold 60000
zeppelin.jdbc.hive.monitor.query_interval 1000
hive.driver org.apache.hive.jdbc.HiveDriver
hive.password
hive.proxy.user.property hive.server2.proxy.user
hive.splitQueries true
hive.url jdbc:hive2://sandbox-hdp.hortonworks.com:10000/default
hive.user hive
Dependencies
Artifact
/opt/zeppelin/interpreter/jdbc/hive-jdbc-uber-2.6.5.0-292.jar
Next step is to go to Ambari http://localhost:8080/ and login as admin. To do that first you must login on Hadoop root account via SSH:
ssh root#127.0.0.1 -p 2222
root#127.0.0.1's password: hadoop
After successful login, you will be prompted to change your password immediately, please do that and next set Ambari admin password with command:
[root#sandbox-hdp ~]# ambari-admin-password-reset
After that you can use admin account in Ambari (login and click Hive link in the left panel):
Ambari -> Hive -> Configs -> Advanced -> Custom hive-site
Click Add Property
Insert followings into the opening window:
hive.security.authorization.sqlstd.confwhitelist.append=tez.application.tags
And after saving, restart all Hive services in Ambari. Everything should be working now if you set the proper Java path in 'zeppelin-env.sh' and port in 'zeppelin-site.xml' (you must copy and rename 'zeppelin-env.sh.template' and 'zeppelin-site.xml.template' in Zeppelin/config directory, please remember that Ambari also use 8080 port!).
In my case, the only thing left to do is add or uncomment the fragment responsible for the Helium plug-in repository (in 'zeppelin-site.xml'):
<property>
<name>zeppelin.helium.registry</name>
<value>helium,https://s3.amazonaws.com/helium-package/helium.json</value>
<description>Enable helium packages</description>
</property>
Now you can go to the Helium tab in the top right corner of the Zeppelin sheet and install the plugins of your choice, in my case it is 'zeppelin-leaflet' visualization. And voilĂ ! Sample vizualization from this Kaggle dataset in Hive:
https://www.kaggle.com/kartik2112/fraud-detection
Have a nice day!

"Select Version" is empty after successfully installed and setup Apache Ambari from source

I managed to build, install and setup Apache Ambari successfully on my Debian 10 machine.
I am able to launch the Ambari admin UI and see the wizard.
However in the second step Select Version of the wizard, I see no choice to select:
What am I suppose to do here?
Thanks!
#wxh You need to install a Stack Repository in order to provide ambari with components and services to install. For example: HDP2 or HDP3 or HDF.
Debian10 repositories are not build, but you can try the Debian 9 ones:
https://docs.cloudera.com/HDPDocuments/Ambari-2.7.3.0/bk_ambari-installation/content/hdp_31_repositories.html
If you are installing from source you may not want to use HDP/HDF which opens up a huge can of worms for making your own stack.

windows cluster - SSH seems to be failing

Two physical systems, each is running Server 2008
Installed DataStax Community (version 2.0.7 64-bit) on each (that is the version number in the DataStax package I downloaded according to the file name)
OpCenter running locally shows a running 1 node cluster. I can execute IO on the system at the command line (using cassandra-stress)
The system names are "5017-cassandra-1" and "5017-cassandra-2"
I'd like to create a cluster in which both nodes participate. This is not a production environment (I'm just trying to learn).
From OpCenter on 5017-cassandra-1 I go to Nodes (I see 1 node of course), Add Nodes.
I leave the "Package" drop down as default (but the latest version shown in the drop down is 2.0.6), enter the IP address of 5017-cassandra-2. I add the Administrator user name and password in the "Node Creditials (sudo)" fields and press "Add Nodes" and get:
Error provisioning cluster: Unable to SSH to some of the hosts
Unable to SSH to 10.108.14.224:
global name 'get_output' is not defined
Reading that I needed to add OpenSSL - I installed the runtime redistributables (on both system) and Win64 OpenSSL-1_0_1h.
The error persists.
any suggestions or link to a step-by-step would be appreciated.

Missing conf directory on windows 32 bit neo4j 1.9.4 community edition installation on remote connection

Im trying to follow http://docs.neo4j.org/chunked/1.9.4/server-configuration.html in order to setup the server to accept external connections (org.neo4j.server.webserver.address=0.0.0.0 on conf/neo4j-server.properties according to that docs).
I downloaded the installer from here http://www.neo4j.org/download_thanks?edition=community&release=1.9.4&platform=windows&packaging=exe&architecture=x32. Note that this is an installer not a archive(rar, bz or whatever).
The "conf" directory does not appear. I have been working around (pass vm arguments at the startup etc but im going nowhere). Until now i have found:
C:\Documents and Settings[myuser]\Datos de
programa\neo4j-community.vmoptions[file]
C:\Documents and
Settings\xp\Datos de programa\Neo4j Community[direcory with a logs
subdirectory]
C:\Documents and Settings[myuser]\Mis
documentos\Neo4j\default.graphdb[directory for default graphdb i
think, here is a neo4j.properties files, is this the
neo4j-server.properties files that docs tell me?]
C:\Archivos de programa\Neo4j Community[directory with
.install4j,bin,jre(i dont have java installed on system)
subdirectories]
The bin folder on C:\Archivos de programa\Neo4j Community just contain:
- neo4j-community.exe
- neo4j-community.vmoptions
- neo4j-community-user-vmoptions.loc
- neo4j-desktop-1.9.4.jar
As you could see there is nothing here like the so mentioned (on docs) Neo4j.bat.
My neo4j server its running perfectly. Even i played a while with the so fun webadmin on localhost:7474. But when i needed to connect from a ubuntu pc and went to docs for help run into the misterious case of missing conf directory.
Are this docs not related to community edition or at least not to the windows 1.9.4 installer?
Right now i just need to connect from a external client, not localhost, but tomorrow i could want to fly in a cow and i suspect for that i will need the conf directory. Any help, in the particular case of remote connection to neo4j server or the general case of missing conf directory will be appreciated.
PS.I found neo4j community V1.9.4 - how to configure IP address and default database location? just a few seconds before i finished to write this. Where i could download the distribution and not the installer? In the official site dont seems to be more choices tan 32/64 bit installer???
If you want to download the distribution, you can download from here
http://www.neo4j.org/download/other_versions
In community version if you notice, there will be option to download "Installer" or "Binary"
just download the binary package and extract, everything (including conf folder) are inside that.
Btw, If you using the window installer, Neo4J will provide you with user-friendly GUI, if you notice, there will be "Settings.." button there for you to configure your server,
To change the server address for ex, you need to modify neo4j-server.properties
Configuration will be the same, just the location of the config file is different.

Setting up RabbitMQ cluster on Windows servers

I am trying to set up a RabbitMQ cluster on Windows servers, and this requires using shared Erlang cookie file. According to the documentation, all I need to do is to ensure that the root directories on different machines contain the same .erlang.cookie file. So what I did is found these files on both machines and overwrote them with the same shared version.
After that all rabbitmqctl commands failed on the machine with new file version with "unable to connect to node..." error message. I tried to restart RabbitMQ Windows service, but still rabbitmqctl complained. I even reinstalled RabbitMQ on that machine, but then .erlang.cookie was reset back to the old version. Whenever I tried to use new version of cookie file, rabbitmqctl failed. When I restored an old version, it worked fine.
Basically I am stuck and can not proceed with cluster setup until I resolve this issue. Any help is appreciated.
UPDATE: Received an answer from RabbitMQ:
"rabbitmqctl will pick up the cookie from the user home directory while the service will pick it up from C:\windows. So you will need to synchronise those with each other, as well as with the other machine."
This basically means that cookie file needs to be repaced in two places: C:\Windows and current_user.
You have the above correct. The service will use the cookie at C:\Windows and when you use rabbitmqctl.bat to query the status it is using the cookie in your user directory (%USERPROFILE%).
When the cookies don't match the error look like
C:\Program Files (x86)\RabbitMQ Server\rabbitmq_server-2.8.2\sbin>rabbitmqctl.bat status
Status of node 'rabbit#PC-FOOBAR' ...
Error: unable to connect to node 'rabbit#PC-FOOBAR': nodedown
DIAGNOSTICS
===========
nodes in question: ['rabbit#PC-FOOBAR']
hosts, their running nodes and ports:
- PC-FOOBAR: [{rabbit,49186},{rabbitmqctl30566,63150}]
current node details:
- node name: 'rabbitmqctl30566#pc-foobar'
- home dir: U:\
- cookie hash: Vp52cEvPP1PukagWi5S/fQ==
There is one more gotcha for RabbitMQ cookies on Windows... If you have a %HOMEDIR% and %HOMEPATH% environment variables (as we do in our current test environment, and sets homedir above to U:\), then RabbitMQ will get the cookie there and if there isn't one it makes one up and writes it there. This left me banging my head on my desk for quite a while when trying to get this working. Once I found this gotcha it was obvious the cookie files were the problem (as documented) they were just at an odd location (not documented AFAIK).
Hope this solves someones pain setting up RabbitMQ Clustering on Windows.

Resources