amabri metrics collector error - hortonworks-data-platform

We have 5 node hortonworks cluster with ambari monitors installed in all nodes and metrics collector installed in master node.
I am getting Connection failed: [Errno 111] Connection refused to 0.0.0.0:6188
PFA for error.
https://drive.google.com/file/d/0B85rPUe3-QPXbXJSRzJmdUwwQU0/view?usp=sharing
I followed the below document and tried removing the service and added it.
https://cwiki.apache.org/confluence/display/AMBARI/Moving+Metrics+Collector+to+a+new+host
First of all, I am not able to find the origin of the error. Please share your experience if you ever faced this problem.

This happens sometimes that port is already being use by another process when we try to move collector to new host with 'Curl' commands specified on apache wiki.
Istead of doing using this you can leverage the feature which Ambari provides from it's GUI to move components from one host to another host .
'Move Master Wizard'
Follow the steps stated at Move Master Wizard , Ambari will take care rest of things for you.

I have fixed this issue by killing the process running in that port and restart the service. You can also do a manual reboot of the machine to fix this issue.

Related

Can someone elaborate on the necessary proxy settings in the install-config.yaml file for an OKD installation in an air-gapped environment?

I am attempting an installation of OKD 4.5 in a restricted (i.e. air-gapped) environment. I am running into an issue during the installation process where-in, as far as I can tell, the bootstrap machine is attempting and failing to access the mirrored registry I have running.
Based on my research, I believe this issue is stemming from a lack of proxy settings within the install-config.yaml file as described in the documentation here, however I am having trouble wrapping my brain around what functions I'm attempting to accommodate by adding this proxy information into the configuration and exactly what information I should be adding. I haven't been able to find any other segments of the documentation that go into details about this either (however if someone can simply point me in the direction of such documentation, that would be extremely helpful).
Would anyone be willing to explain to me what values should be going into the proxy lines in this file and why? Does this information replace, compliment, or require changes in any way to the networking segment of the configuration?
As a related question, do I need to change any of the networking subnet values to reflect my local network? In all examples I've seen the clusterNetwork.cidr and serviceNetwork subnets are the same as the documentation (cidr: 10.128.0.0/14, serviceNetwork: - 172.30.0.0/16), and some include an additional machineNetwork field. Is this field something I should be adding and if so, should I just be including my subnet for this field?
As context for my specific scenario, here are my environment specifications as well as the specific errors I am getting:
OKD Release: `4.5.0-0.okd-2020-10-15-235428`
Environment: Virtualized Bootstrap, Master and Worker nodes, in virt-manager, running on Centos 7 in
air-gapped environment. This host machine contains the install directory and also provides DNS,
Apache Server, HAProxy for load balancing and the mirrored registry.
Errors:
From <log-bundle>/bootstrap/journals/release-image.log:
localhost.localdomain release-image-download.sh[114151]: Error: Error initializing source docker://okd-services.okd.local:5000/okd#sha256<.....>:
error pinging docker registry okd-services.okd.local:5000: Get "https://okd-services.okd.local:5000/v2/":
dial tcp <okd-services.okd.local ip>:5000: connect: connection refused
From systemctl status named (several requests to IPs I don't recognize which seem to be NTP requests):
network unreachable resolving '2.fedora.pool.ntp.org.okd/AAAA..
network unreachable resolving './NS/IN': 199.7.91.13#53
etc
I have ensured that host-node and node-node communication is present, and that the registry is accessible from the nodes ( to test, I netcat the certificate pem into a node and update its trusts, then curl -u the registry using https://fqdn:5000/v2/_catalog), so I am fairly certain all the connections are established properly.
To conclude, since I'm fairly sure that the proxy/network settings in the install-config.yaml file are to blame, and since I am unable to find more elaboration on these configurations in the official docs or elsewhere, I would very much appreciate any in-depth explanation of how I should be configuring this for an air-gapped environment. Additionally, if anyone believes that another issue is the cause, any input regarding that would be great.

Unable to remove a dead host from Ambari

We had some problem with a host and we had to shutdown that host.
Now, we are not able to remove that dead host from Ambari.
Whenever we click Hosts -> Click on the host that is dead -> Host Actions -> Delete Host
This host cannot be deleted since it has the following master components: DRPC Server, Falcon Server etc.
If I go to that service, all the actions against each services are greyed out. So, there is no way I can move those services to another hosts because those are disabled.
Please suggest a way ahead. Is handling sudden death of a service not possible in Ambari?
You can try the Ambari API as explained here. Some features of the Ambari API aren't implementet in der User Interface right now.
I remeber a case in my company where we couldn't remove a Node with the Ambari UI. With an API-Call like it's explained this link it was possible.

Connection issue when using hbase in ambari

Recently I have been working on Ambari. But after I installed successfully, everything is working well except the HBase. Only the HBase master is good, and other RegionServers all get the alert:
Connection failed: [Errno 111] Connection refused to server1.hadoop:16030. (the domain name differs from machines.)
Anyone have the same problem?
I have fixed this issue. I read the log file in /var/log/hbase/*.log of my region servers ,and find that it's clock is not sync with the master's. So I make all the servers to sync its clock to the master's using ntpd. Then I restart the ambari components and no alert showed up!!!
Think this may help those with the problem.

Apache Spark error : Could not connect to akka.tcp://sparkMaster#

This is our first steps using big data stuff like apache spark and hadoop.
We have a installed Cloudera CDH 5.3. From the cloudera manager we choose to install spark. Spark is up and running very well in one of the nodes in the cluster.
From my machine I made a little application that connects to read a text file stored on hadoop HDFS.
I am trying to run the application from Eclipse and it displays these messages
15/02/11 14:44:01 INFO client.AppClient$ClientActor: Connecting to master spark://10.62.82.21:7077...
15/02/11 14:44:02 WARN client.AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#10.62.82.21:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#10.62.82.21:7077
15/02/11 14:44:02 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#10.62.82.21:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: no further information: /10.62.82.21:7077
The application is has one class the create a context using the following line
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("Spark Count").setMaster("spark://10.62.82.21:7077"));
where this IP is the IP of the machine spark working on.
Then I try to read a file from HDFS using the following line
sc.textFile("hdfs://10.62.82.21/tmp/words.txt")
When I run the application I got the
Check your Spark master logs, you should see something like:
15/02/11 13:37:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster#mymaster:7077]
15/02/11 13:37:14 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkMaster#mymaster:7077]
15/02/11 13:37:14 INFO Master: Starting Spark master at spark://mymaster:7077
Then when your connecting to the master, be sure to use exactly the same hostname as found in the logs above (do not use the IP address):
.setMaster("spark://mymaster:7077"));
Spark standalone is a bit picky with this hostname/IP stuff.
When you create your Spark master using the shell command "sbin/start-master.sh". go the the address http://localhost:8080 and check the "URL" row.
I notice no accepted answer, just for info I thought I'd mention a couple things.
First, in the spark-env.sh file in the conf directory, the SPARK_MASTER_IP and SPARK_LOCAL_IP settings can be hostnames. You don't want them to be, but they can be.
As noted in another answer, Spark can be a little picky about hostname vs. IP address, because of this resolved bug/feature: See bug here. The problem is, it's not clear if they "resolved" is simply by telling us to use IP instead of hostname?
Well I am having this same problem right now, and the first thing you do is check the basics.
Can you ping the box where the Spark master is running? Can you ping the worker from the master? More importantly, can you password-less ssh to the worker from the master box? Per 1.5.2 docs you need to be able to do that with a private key AND have the worker entered in the conf/slaves file. I copied the relevant paragraph at the end.
You can get a situation where the worker can contact the master but the master can't get back to the worker so it looks like no connection is being made. Check both directions.
Finally of all the combinations of settings, in a limited experiment just now I only found one that mattered: On the master, in spark-env.sh, set the SPARK_MASTER_IP to the IP address, not hostname. Then connect from the worker with spark://192.168.0.10:7077 and voila it connects! Seemingly none of the other config parameters are needed here.
Here's the paragraph from the docs about ssh and slaves file in conf:
To launch a Spark standalone cluster with the launch scripts, you
should create a file called conf/slaves in your Spark directory, which
must contain the hostnames of all the machines where you intend to
start Spark workers, one per line. If conf/slaves does not exist, the
launch scripts defaults to a single machine (localhost), which is
useful for testing. Note, the master machine accesses each of the
worker machines via ssh. By default, ssh is run in parallel and
requires password-less (using a private key) access to be setup. If
you do not have a password-less setup, you can set the environment
variable SPARK_SSH_FOREGROUND and serially provide a password for each
worker.
Once you have done that, using the IP address should work in your code. Let us know! This can be an annoying problem, and learning that most of the config params don't matter was nice.

Use spark-submit to submit a application to EC2 cluster

I am new to Spark and I am trying to run it on EC2. I follow the tutorial on spark webpage by using spark-ec2 to launch a Spark cluster. Then, I try to use spark-submit to submit the application to the cluster. The command looks like this:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://ec2-54-88-9-74.compute-1.amazonaws.com:7077 --executor-memory 2G --total-executor-cores 1 ./examples/target/scala-2.10/spark-examples_2.10-1.0.0.jar 100
However, I got the following error:
ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
Please let me know how to fix it. Thanks.
You're seeing this issue because the master node of your spark-standalone cluster cant open a TCP connection back to the drive (on your machine). The default mode of spark-submit is client which runs the driver on the machine that submitted it.
A new cluster mode was added to spark-deploy that submits the job to the master where it is then run on a client, removing the need for a direct connection. Unfortunately this mode is not supported in standalone mode.
You can vote for the JIRA issue here: https://issues.apache.org/jira/browse/SPARK-2260
Tunneling your connection via SSH is possible but latency would be a big issue since the driver would be running locally on your machine.
I'm curious if you still having this issue ... But in case anyone is asking here is a brief answer. As clarified by jhappoldt, the master node of your spark-standalone cluster cant open a TCP connection back to the drive (on your local machine). Two workarounds are possible, tested and succeeded.
(1) From EC2 Management Console, create a new security group and add rules to enable TCP back and forth from your PC (public IP). (what I did was adding TCP rules inbound and outbound) ... Then add this security group to your master instance. (right click --> Networking --> Change security groups). Note: add it and don't remove the already established security groups.
This solution work well, but in your specific scenario, deploying your application from local machine to EC2 cluster, you will face further problems (resource related) so the next option is the best one
(2) Having your .jar file (or .egg) copy it to the master node using scp. You can check this link http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html for information about how to do that; and deploy your application from the master node. Note: spark is already pre-insalled so you will do nothing but write the same exact command you write on your local machine from ~/spark/bin. This shall work perfect.
Are you executing the command on your local machine, or on the created EC2 node? If you're doing it locally, make sure port 7077 is open in the security settings, as its closed to the outside by default.

Resources