Error in Cloudera Cluster installation process? - hadoop

I have installed Cloudera manager successfully. It shows Currently managed hosts as 127.0.0.1 and it is active.
When I search and install cluster using the cloudera manager after the loads it shows the following error.
Installation failed. Failed to receive heartbeat from agent.
Ensure that the host's hostname is configured properly.
Ensure that port 7182 is accessible on the Cloudera Manager server (check firewall rules).
Ensure that ports 9000 and 9001 are free on the host being added.
Check agent logs in /var/log/cloudera-scm-agent/ on the host being added (some of the logs can be found in the installation details).
The following image clearly shows the problem while installing my cluster on cloudera manager.

I had a similar problem and it turned out the issue was conveniently skipping (unfortunately) the ...password-less SSH key ... step
After several hours breaking my head over it, I realised this.
At the terminal do,
ls -al ~/.ssh
You must see files like,
abc
abc.pub
These are you public/private key pairs. [Not necessarily the same names as mine above].The file name you used in Setting up SSH public/private keys step for your machine.
You need to copy the data in abc.pub to a file authorized_keys in this same folder. If its not there, create authorized_keys.
Incase you don't have you public/private key pair see here

For ubuntu, the problem is usually because of the association of "ubuntu 127.0.1.1." in your /etc/hosts file. For me, after changing it to "ubuntu 127.0.0.1", which is the standard local loopback, I can add the cluster successfully. Hope this helps!

I was struggling with this problem for two days. Fixing /etc/hosts as suggested by "khoadoan" worked for me.
/etc/hosts was looking like this when I had the problem
127.0.0.1 localhost
127.0.1.1 ubuntu
I changed it like this:
127.0.0.1 localhost
127.0.0.1 ubuntu
Restarted the machine.
sudo init 6
Launched the Cloudera Manager Admin page. This time the host status was already showing up "Managed = Yes". And I got an additional tab "Currently Managed Hosts(1)", where the local host was listed.

Related

RabbitMQ Erlang distribution failed

I have two Windows Server 2012 R2 machines located in one of the client's datacenters. Both servers are domain-joined. They both have RabbitMQ 3.6.0. installed on them. RabbitMQ is running as Windows Service on both machines. I've tried to cluster these two machines for a long time now without success. I always get the following error when I try to cluster them.
One the first machine nodeA I run the command 'rabbitmqctl join_cluster rabbit#nodeB'. This is what I get:
Clustering node 'rabbit#nodeA' with 'rabbit#nodeB' ...
Error: unable to connect to nodes ['rabbit#nodeB']: nodedown
`DIAGNOSTICS`
===========
attempted to contact: ['rabbit#nodeB']
rabbit#nodeB:
* connected to epmd (port 4369) on nodeB
* epmd reports node 'rabbit' running on port 25672
* TCP connection succeeded but Erlang distribution failed
* suggestion: hostname mismatch?
* suggestion: is the cookie set correctly?
* suggestion: is the Erlang distribution using TLS?
current node details:
- node name: 'rabbitmq-cli-3892#nodeA'
- home dir: C:\Users\mydirectory
- cookie hash: l+SSu57+cRyAQ03AJdwAbQ==
I've tried this setup with Azure Virtual Machines within Azure Virtual Network and I succeeded to cluster the two VM's, however it seems I cannot connect these two (customer's machines) together.
This is what I have done and ensured:
There isn't any firewall blocking connections
Added host names to hosts file located on C:\Windows\system32\drivers\etc
Tried to refer to host names as FQDN without adding anything to hosts file
Tried to refer to host names with CAPITAL letters and without
Copied the same exact .erlang.cookie to C:\Windows and C:\Users\mydirectory on both machines.
I've read, understood and applied RabbitMQ Clustering Guide https://www.rabbitmq.com/clustering.html
Stopped, restarted, reinstalled RabbitMQ on both machines.
It seems I can't get it to work. On Azure machines, which were not domain-joined clustering worked beautifully. I am really running out of options... Any help?
i had the same problem you need to install rabbitmq as a admin. uninstall then reinstall as admin and it should work fine
Try to connect to each of RabbitMQ nodes via remote shell and check if value of cookie is the same (cookie can be set in 3 different ways: .erlang.cookie is one of them).
erl -remsh 'rabbitmq-cli-3892#nodeA' -name 'test#nodeA'
erlang:get_cookie().

cloudera host with bad health during install

Trying again & again with all required steps completed but cluster Installation when install selected Parcels, always shows every host with bad health. setup never completed at full.
i am installing cm 5.5 on CentOS 6.7 using virtualbox.
The Error
Host is in bad health cm.feuni.edu
Host is in bad health dn1.feuni.edu
Host is in bad health dn2.feuni.edu
Host is in bad health nn1.feuni.edu
Host is in bad health nn2.feuni.edu
Host is in bad health rm.feuni.edu
above error are shown on step 6 where setup says
The selected parcels are being downloaded and installed on all the hosts in the cluster
in previous step 5 all hosts were completed with heartbeat checks in the end
memory distributions
cm 8GB
all others with 1GB
i could not find proper answer anywhere else. What reason could be for the bad health?
I don't know if it will help you...
For me, after a few days I struggled with it,
I found the log files (at )
It had a comment there is a mismatch of the guid,
so I uninstalled everything from both machines (using the script they give,/usr/share/cmf/uninstall-cloudera-manager.sh , yum remove 'cloudera-manager-*' and deletion of every directory related to cloudera I found...)
and then removed the guid file:
rm /var/lib/cloudera-scm-agent/cm_guid
Afterwards I re-installed everything, and that fixed that issue for me...
I read online that there can be issues with the hostname and things like that, but I guess that if you get to this part of the installation, you already fixed all the domain/FDQN/hosname/hosts issues.
It saddens me there is no real manual/FAQ for this product.. :(
Good luck!
I faced the same problem. This is my solution:
First I edited config.ini
$ nano /etc/cloudera-scm-agent/config.ini
so that the hostname where the same as the command $ hostname returned.
then I restarted the agent and the server of cloudera:
$ service cloudera-scm-agent restart
$ service cloudera-scm-server restart
then in cloudera manager I deleted the cluster and added again. The wizard continued to run normally.

Unable to get Mesos to run from tutorial: Setting up a Single Node Mesosphere Cluster

I have been following this tutorial to try and setup a single node mesosphere cluster from their
official tutorial:
http://mesosphere.com/docs/getting-started/developer/single-node-install/
I followed all the commands without any issues, and I also added the ports 5050 and 8080 to my security group. When I try to access the console for mesos/marathon, I get a "Internet Explorer cannot display the webpage" message.
They also recommend checking it the following way:
MASTER=$(mesos-resolve `cat /etc/mesos/zk`)
mesos-execute --master=$MASTER --name="cluster-test" --command="sleep 5"
But that comes up with an error:
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0106 17:03:08.126703 20993 process.cpp:1561] Failed to initialize, gethostbyname2: Unknown host
*** Check failure stack trace: ***
I am not really sure how to troubleshoot this either, and there are not many tutorials I could find on how to install mesos on ubuntu.
I checked the contents of the zk file, seems to be the default value.
$ cat /etc/mesos/zk
zk://localhost:2181/mesos
I would really appreciate any clues on how to go about this one.
Edit: The process is definitely running too - just an fyi:
root 31545 8.5 5.9 187464 35604 ? Ssl 17:28 0:00 /usr/local/sbin/mesos-slave --master=zk://localhost:2181/mesos --log_dir=/var/log/mesos
root 31563 28.5 2.1 116304 12856 ? Rs 17:28 0:00 /usr/local/sbin/mesos-master --zk=zk://localhost:2181/mesos --port=5050 --log_dir=/var/log/mesos --quorum=1 --wo
Mesos uses gethostbyname2 to resolve hostnames to IPs. The first thing I would recommend, is to try "ping localhost" and "ping hostname", and verify that there are no strange settings in /etc/hosts. If you're doing a multi-node cluster, I'd recommend that hostname map to the public IP address (not 127.0.x.1).
If that doesn't help, you can try setting the --ip and --hostname flags when starting mesos-master and mesos-slave, to bypass the gethostbyname2 resolution. These can also be set by writing to the file-based parameters, e.g. /etc/mesos/mesos-master/ip
For additional troubleshooting, try running wget http://localhost:5050 (or curl -L) from the mesos master, to verify that it is locally visible. Also try wget http://<public_ip>:5050 to verify that the web server is up and serving to the public IP. Depending on how your (EC2?) node is setup, you may need to expose/forward the port, or connect to a VPN.
Thanks Adam. I ran the wget and curl commands, and nothing was actually listening on port 8080 or 5050. I did open those ports in the ec2. A simple reboot did the trick however, once I ssh'ed into the ec2 instance after the reboot, both mesos and marathon were running and both ports are now showing after I ran
netstat -ntln.

Hadoop Ambari cannot confirm hosts

I tried to use Ambari to manage the installation and maintenance of the Hadoop cluster.
After I started ambari server, I use the web page to set up Hadoop cluster.
But at the 3rd step-- confirm hosts, the error shows below
And I check the log at /var/log/ambari-server, I found:
INFO:root:BootStrapping hosts ['qiao'] using /usr/lib/python2.6/site-packages/ambari_server cluster primary OS: redhat6 with user 'root' sshKey File /var/run/ambari-server/bootstrap/1/sshKey password File null using tmp dir /var/run/ambari-server/bootstrap/1 ambari: master; server_port: 8080; ambari version: 1.4.1.25
INFO:root:Executing parallel bootstrap
ERROR:root:ERROR: Bootstrap of host qiao fails because previous action finished with non-zero exit code (1)
INFO:root:Finished parallel bootstrap
Do you provide ssh rsa private key or paste it?
and from the place you are installing, make sure you can ssh to any hosts without typing any password.
If still the same error, try
ambari-server reset
ambari-server setup
Pls restart ambari-server
ambari-server restart
and then try accessing Ambari
It would work.
Make sure you can ssh to every single host on the list, including all master hosts.
To do this, ensure that Ambari host's .ssh/id_rsa.pub entry is included in every hosts' .ssh/authorized_keys file. Then ssh from Ambari's host to every single server - and check if it is asking for your password. You can use a tutorial like http://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/ to check if everything has been done properly.
You need to do the same on the Ambari host itself, if you added it to hosts list.

Install Chef-Server 11 on EC2 Instance

I am using hosted Chef for quite some time. Wanted to explore the opensource chef server. hence I am trying to setup my Chef-Server 11 on EC2 instance.
I have Chef-server running and I can access the web GUI for the same. I have the chef-workstation configured on another ec2 instance that is also working fine.
Problem: I am not able to upload any cookbook.
I get below error when I try uploading the cookbook:
# knife cookbook upload getting-started
Uploading getting-started [0.4.0]
/opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:763:in `initialize': Connection refused - connect(2) (Errno::ECONNREFUSED)
However, other list commands of knife are working fine.
I did my home work and bumped on below links:
http://www.opscode.com/blog/2013/03/11/chef-11-server-up-and-running/
http://www.curvve.com/blog/servers/2013/script-to-configure-and-set-your-hostname-and-fqdn-on-ec2-instances/
So,
It is mentioned that the chef-server needs a working FQDN to work. I set the my public ec2 host name as the hostname of the server as well as set it up in /etc/hosts. Rebooted the instance. Ran chef-server-ctl reconfigure again. And still facing the same error.
QUESTION: How to figure out the FQDN part of the EC2 instance for chef-server to work? if anyone has set up chef-server successfully on EC2 and was able to upload the cookbooks, then please share your steps for FQDN workout.
I was having a hard time with this but this solution worked!
Edit /etc/chef-server/chef-server.rb and add these lines (create the file if it doesn't exist):
server_name = "THE PUBLIC IP OF YOUR INSTANCE"
api_fqdn server_name
nginx['url'] = "https://#{server_name}"
nginx['server_name'] = server_name
lb['fqdn'] = server_name
bookshelf['vip'] = server_name
I found the solution here
http://sahebjade.blogspot.com/2013/05/check-your-knife-configuration-and.html
This is how i got it working. updated the public DNS name of my ec2 instance (chef-server) in /etc/sysconfig/network and service network restart. Now I am able to upload the cookbooks fine.
Need to think about elastic IP as potential option for my chef-server.
Edit /etc/chef-server/chef-server.rb and add these lines (create the file if it doesn't exist):
bookshelf["vip"] = node["ipaddress"]
bookshelf["url"] = "https://#{node["ipaddress"]}"
erchef['s3_url_ttl'] = 3600
The first two lines will point your chef-server URL to the machine's IP and the third will solve a timeout issue that apparently always exist when the Chef Server is on EC2.
I wanted to expand some on the answers since they don't give a complete picture. This applies to Chef 11 (hopefully Chef 12 is smarter)
In my case I rolled a master up under VPC #1 which gave it an internal address like this
ip-10-0-0-10.ec2.internal
Because I was only playing with the VPC initially, I had misconfigured some things I needed so I had to drop it and I created a new scheme. Thankfully, I was able to snapshot the old Chef master and bring it up under the new VPC but I found that I couldn't log into Chef anymore. It took some digging but I found in my /var/log/chef-server/chef-server-webui/current log that the install had glommed onto the old hostname and set that as the internal URL for... everything. This caused problems after the internal hostname change
2014-12-24_16:19:09.46680 SocketError: Error connecting to https://ip-10-0-0-10.ec2.internal/users/admin - getaddrinfo: Name or service not known
Now, to the OP answer
Need to think about elastic IP as potential option for my chef-server
In my case, I just added a CNAME to CloudFlare and set that as my permanent address. Since I can set CloudFlare to a low TTL on that one address it makes it easy to move it around between IP changes (I don't need an Elastic IP while I'm just getting it configured). This way I could then tell Chef to always look for the same URL and not worry about an EIP.
Once that was done, I had to update Chef. I don't know what changed (this is 11.16.4) but I found the configs live in /var/opt/chef-server/chef-server-webui/etc/chefserver.rb as opposed to some of the other answers listing chef-server.rb. Not sure if that's a YMMV thing or not.
I changed the following towards the bottom of that file
# Environment specific application configuration.
# These values override the ones set in 'RAILS_ROOT/config/application.rb'
#config.chef_server_url = "https://ip-10-0-0-10.ec2.internal"
config.chef_server_url = "https://chef.mydomain.com"
I also changed /var/opt/chef-server/nginx/etc/chef_https_lb.conf
server_name chef.mydomain.com;
Finally I restarted Chef
chef-server-ctl restart
That seems to have done the trick. Logins work again.

Resources