Kafka: client has run out of available brokers to talk to - elasticsearch

I'm trying to wrap up changes to our Kafka but I'm in over my head and am having a hard time debugging the issue.
I have multiple servers funneling their Ruby on Rails logs to 1 Kafka broker using Filebeat, from there the logs go to our Logstash server, and are then stashed in Elasticsearch. I didnt setup the original system but I tried taking us down from 3 Kafka servers to 1 as they weren't need. I updated the IP address configs in these files in our setup to remove the 2 old Kafka servers and restarted the appropriate services.
# main (filebeat)
sudo vi /etc/filebeat/filebeat.yml
sudo service filebeat restart
# kafka
sudo vi /etc/hosts
sudo vi /etc/kafka/config/server.properties
sudo vi /etc/zookeeper/conf/zoo.cfg
sudo vi /etc/filebeat/filebeat.yml
sudo service kafka-server restart
sudo service zookeeper-server restart
sudo service filebeat restart
# elasticsearch
sudo service elasticsearch restart
# logstash
sudo vi /etc/logstash/conf.d/00-input-kafka.conf
sudo service logstash restart
sudo service kibana restart
When I tail the Filebeat logs I see this -
2018-04-23T15:20:05Z WARN kafka message: client/metadata got error from broker while fetching metadata:%!(EXTRA *net.OpError=dial tcp 172.16.137.132:9092: getsockopt: connection refused)
2018-04-23T15:20:05Z WARN kafka message: client/metadata no available broker to send metadata request to
2018-04-23T15:20:05Z WARN client/brokers resurrecting 1 dead seed brokers
2018-04-23T15:20:05Z WARN kafka message: Closing Client
2018-04-23T15:20:05Z ERR Kafka connect fails with: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)

to 1 Kafka broker... I tried taking us down from 3 Kafka servers to 1 as they weren't need. I updated the IP address configs in these files in our setup to remove the 2 old Kafka servers and restarted the appropriate services
I think you are misunderstanding that Kafka is only a highly available system if you have more than one broker, so the other 2 are needed despite you possibly only providing a single broker in the logstash config
Your errors state the single broker refused a connection, and therefore no logs will be sent to it.
At a minimum, I would recommend 4 brokers, and a replication factor of 3 on all your critical topics for a useful Kafka cluster.. That way, you can tolerate broker outages as well as distribute the load of your Kafka brokers.
It would also be beneficial to make the topic count a factor of your total logging servers, as well as key a Kafka message based on the application type, for example. That way you are guaranteed log order for those applications

Related

Spring application unable to access kafka running in kubernetes minikube

I used bitnami/kafka to deploy kafka on minikube. A describe of the pod kafka-0 looks says that server address is:
KAFKA_CFG_ADVERTISED_LISTENERS:INTERNAL://$(MY_POD_NAME).kafka-headless.default.svc.cluster.local:9093,CLIENT://$(MY_POD_NAME).kafka-headless.default.svc.cluster.local:9092
My kafka address is set like so in Spring config properties:
spring.kafka.bootstrap-servers=["kafka-0.kafka-headless.default.svc.cluster.local:9092"]
But when I try to send a message I get the following error:
Failed to construct kafka producer] with root cause:
org.apache.kafka.common.config.ConfigException:
Invalid url in bootstrap.servers: ["kafka-0.kafka-headless.default.svc.cluster.local:9092"]
Note that this works when I run kafka locally and set the bootstrap-servers address to localhost:9092
How do I fix this error? What is the correct kafka URL to use and where do I find it? thanks
Minikube network is different to the host network, you need a bridge.
The advertised listener is in the minikube realm, not findable from the host.
You could setup a service and an ingress in minikube pointing to your kafka, setup your hosts file to the ip address of the ingress and the hostname advertised.
spring.kafka.bootstrap-servers needs valid server hostnames along with port number as comma-separated
hostname-1:port,hostname-2:port
["kafka-0.kafka-headless.default.svc.cluster.local:9092"] is not looking like one!

How to set metricbeat on amazon elasticsearch

I have two servers one for production and one for test, I've been trying to install metricbeat on both servers.
I did on my test server and set it to send logs to my amazon elasticsearch service and now I can see on kibana all data from that server but I did the same process to my production server and when I use command sudo metricbeat -e it starts but I get
ERROR pipeline/output.go:74 Failed to connect: Get https://amazon-endpoint request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) but the config for both is the same

Does fluentd depend on rsyslog?

Still wrapping my head around logging technology. I'm following the fluentd to graylog2 recipe but I don't understand this step:
Open /etc/rsyslog.conf and add the following line to the beginning of the file: *.* #127.0.0.1:5140 Then, restart rsyslogd by running sudo /etc/init.d/rsyslog restart.
What's supposed to listen on 127.0.0.1:5140? Is rsyslog a fluentd dependency?
According to Parse Syslog Messages Robustly:
The problem with syslog is that services have a wide range of log
format, and no single parser can parse all syslog messages
effectively.
Rsyslog seems the recommended way to forward logs to fluentd.
Fluentd listens on the port 5140 if you enable the Rsyslog input. Changing the line in
/etc/rsyslogd.conf
forwards the traffic from Rsyslog to Fluentd.
However, if you don't want to turn on Rsyslog you can just send the traffic straight to port 5140.

Cannot connect to Cloudera Manager, not listening on port 7180

I'd really appreciate some help to get cloudera manager running on AWS EC2.
Its my first install, and I'm aiming to use the AWS Free Tier to spin up a few nodes and do some training on Hadoop cluster and the cloudera distribution. I'm using the RedHat RHEL 7.2 image on AWS EC2.
I am following the instructions here... Cloudera Manager installation
I have installed cloudera manager OK, and get to the screen where it invites you to use a browser to log-in to the cloudera manager server. But that's where the problem starts. It seems the app is not listening on port 7180, so there's no hope of connecting from another machine across the network. I can't even connect locally, on the server, yet the service appears to be running OK. But its not listening on port 7180.
Q1 - How can I confirm the config is set to use port 7180.?
Q2 - are there obvious steps that I'm missing here ?
Thanks in advance,
[Edit..]
I'm beginning to wonder if the Free EC2 host is running short on memory to run cloudera manager. I saw one comment that implied that....AWS Forum post . But the process doesn't crash or report any problems in its logfile. So it must be OK, right?
[Edit.... with more diagnostic info....]
Here's a list of the diagnostics I've checked:-
SELinux is not running [for install and testing purposes],
WAN firewalls,
EC2 firewall/Security group,
Local firewall on server,
Cloudera manager log,
Is the service up and running?
Can you connect locally?
Securtity group on the EC2 instance, it contains:-
SSH and Port 7180,
Firewall/iptables/firewalld on the RedHat instance, tried:-
adding ports to iptables, then
dissabling iptables, then
adding ports to firewalld, then
dissabling the firewalld service,
$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
ACCEPT tcp -- anywhere anywhere tcp dpt:ssh
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:7180
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:7182
But I'm getting the feeling that the installation of cloudera manager is not happy, or not running correctly.
I've checked the cloudera manager log, and it ends with the following.
$ tail /var/log/cloudera-scm-server/cloudera-scm-server.log
2016-02-25 11:02:23,581 INFO main:com.cloudera.cmon.components.MetricSchemaUpdate: persisting 19264 new metrics
2016-02-25 11:02:28,920 INFO main:com.cloudera.cmon.components.MetricSchemaUpdate: persisting 0 updated metrics
2016-02-25 11:02:28,924 INFO main:com.cloudera.cmon.components.MetricSchemaManager: Cross entity aggregates processed.
And when I use tail -f, and restart the cloudera-scm-server service, the log scrolls a lot, and comes back the same state. If I search for ERROR, there are no lines with "ERR".
$ sudo service cloudera-scm-server start
Starting cloudera-scm-server (via systemctl): [ OK ]
$ sudo systemctl status cloudera-scm-server
● cloudera-scm-server.service - LSB: Cloudera SCM Server
Loaded: loaded (/etc/rc.d/init.d/cloudera-scm-server)
Active: active (exited) since Thu 2016-02-25 12:23:03 EST; 44s ago
Docs: man:systemd-sysv-generator(8)
Process: 747 ExecStart=/etc/rc.d/init.d/cloudera-scm-server start (code=exited, status=0/SUCCESS)
So, if I try to test the service, by connecting from the local machine I get the sort of behavious that makes me thing its just not listening, and maybe not started correctly.
Try poke it with a curl from the same shell as the cloudera-scm-server service was started
$ curl localhost:7180
curl: (7) Failed connect to localhost:7180; Connection refused
$ wget localhost:7180
--2016-02-25 08:00:16-- http://localhost:7180/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:7180... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:7180... failed: Connection refused.
Try check what ports are listening on that machine, no 7180 , what's up with that???
$ netstat -nltp
(No info could be read for "-p": geteuid()=1000 but you should be root.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:7432 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN -
tcp6 0 0 :::7432 :::* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 ::1:25 :::* LISTEN -
Here's what to look for, and a possible solution - give it more memory...
Check the status of the cloudera-scm-server service using [depending on your flavour of linux]
$ sudo service cloudera-scm-server status
OR
$ sudo systemctl status cloudera-scm-server
Look for the status - Active: active (running)
But if you find - Active: active (exited)
you may have a problem during the startup of the cloudera-scm-server.
In which case, look at the log files for cloudera-scm-server
$sudo ls -l /var/log/cloudera-scm-server
$sudo cat /var/log/cloudera-scm-server/cloudera-scm-server.out
JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x000000078dc58000, 265809920, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 265809920 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid831.log
[ec2-user#ip-172-31-31-166 ~]$ sudo tail -100 /var/log/cloudera-scm-server/cloudera-scm-server.out
JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x000000078dc58000, 265809920, 0) failed; error='Cannot allocate memory' (errno=12)
Use the command top to indicate how much memory is available to your system.
Possible solution - have a look at this discussion at Cloudera forum
In this case the java heap size was too small.
As we see that heap was exhausted, assuming this is not a memory leak
or something of the sort, Cloudera Manager may need more heap to
operate. This can be configured in:
/etc/default/cloudera-scm-server You could, for instance, change "-Xmx2G" to "-Xmx3G" or "-Xmx4G" If the problem still
happens, perhaps the heap dumps will yeild some clues.
I'd suggest you tail the logs. If you are using the free tier, cloudera manager will take a while to come up... possibly up to 5 minutes or more after you start the cloudera-scm-server.
The logs should show if there are any errors, possibly issues with memory allocation since the free tier servers have limited memory available. The little snippet of log entries looks fine and typical - it will go through a long list of processes before the UI comes up on 7180.
Also while that is going on, run top or even free -g to see how much resources are being used - particularly memory.
I was having the exact same issue, cannot hit the CM login using public DNS or IP on port 7180.
Following steps will help you :
iptables stopped (service iptables stop)
SELinux disabled (got to /etc/selinux/config and disbaled the selinux)
curl/wget localhost:7180 works (check the curl status)
ufw allow 7180
service httpd status should be running.
check va/log/cloudera-scm-server log : if any error found then troubleshoot the error
cloudera-scm-server status (should be running state)
netstat -nap | grep 7180 returns (if running other service then kill it)
telnet localhost 7180 (should be connected)
Cannot connect to Cloudera Manager, not listening on port 7180
1] Check the status:
sudo service cloudera-scm-server status
*cloudera-scm-server.service - LSB: Cloudera SCM Server Loaded: loaded (/etc/rc.d/init.d/cloudera-scm-server; bad; vendor preset: disabled) Active: active (exited) since UTC; 47min ago Docs: man:systemd-sysv-generator(8) rm /var/run/cloudera-scm-server.pid
NOTE : The Cloudera Manager service will not be running as it exited abnormally.
Running service cloudera-scm-server status will print following message "cloudera-scm-server dead but pid file exists".
Reason: Out of memory.
Solution : Examine the heap dump that the Cloudera Manager Server creates when it runs out of memory. The heap dump file is created in the /tmp directory, has file extension .hprof and file permission of 600. Its owner and group will be the owner and group of the Cloudera Manager server process, normally cloudera-scm:cloudera-scm.
Link : http://www.cloudera.com/documentation/manager/5-0-x/Cloudera-Manager-Diagnostics-Guide/cm5dg_troubleshooting_cluster_config.html
Check the status of `cloudera-scm-server` and follow the instructions ahead:
[root#quickstart ~]# `service cloudera-scm-server status`
By default, Cloudera's QuickStart VM manages CDH using Linux's configuration
and service management. To use Cloudera Manager instead, you must shut down
and disable the existing CDH services and then start Cloudera Manager. You can
do this by running the following command:
`sudo /home/cloudera/cloudera-manager`
[root#quickstart ~]# `sudo /home/cloudera/cloudera-manager `
`[QuickStart] Shutting down CDH services via init scripts...
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
[QuickStart] Disabling CDH services on boot...
[QuickStart] Starting Cloudera Manager services...
[QuickStart] Deploying client configuration...
[QuickStart] Starting CM Management services...
[QuickStart] Enabling CM services on boot...
[QuickStart] Starting CDH services...`
________________________________________________________________________________
Success! You can now log into Cloudera Manager from the QuickStart VM's browser:
http://quickstart.cloudera:7180
Username: cloudera
Password: cloudera

Change IP address of a Hadoop HDFS data node server and avoid Block pool errors

I'm using the cloudera distribution of Hadoop and recently had to change the IP addresses of a few nodes in the cluster. After the change, on one of the nodes (Old IP:10.88.76.223, New IP: 10.88.69.31) the following error comes up when I try to start the data node service.
Initialization failed for block pool Block pool BP-77624948-10.88.65.174-13492342342 (storage id DS-820323624-10.88.76.223-50010-142302323234) service to hadoop-name-node-01/10.88.65.174:6666
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException): Datanode denied communication with namenode: DatanodeRegistration(10.88.69.31, storageID=DS-820323624-10.88.76.223-50010-142302323234, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=cluster25;nsid=1486084428;c=0)
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:656)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:3593)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.registerDatanode(NameNodeRpcServer.java:899)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.registerDatanode(DatanodeProtocolServerSideTranslatorPB.java:91), I was unable to start the datanode service due to the following error:
Has anyone had success with changing the IP address of a hadoop data node and join it back to the cluster without data loss?
CHANGE HOST IP IN CLOUDERA MANAGER
Change Host IP on all node
sudo nano /etc/hosts
Edit the ip cloudera config.ini on all node if the master node ip changes
sudo nano /etc/cloudera-scm-agent/config.ini
Change IP in PostgreSQL Database
For the password Open PostgreSQL password
cat /etc/cloudera-scm-server/db.properties
Find the password lines
Example. com.cloudera.cmf.db.password=gUHHwvJdoE
Open PostgreSQL
psql -h localhost -p 7432 -U scm
Select table in PostgreSQL
select name,host_id,ip_address from hosts;
Update table IP
update hosts set ip_address = 'xxx.xxx.xxx.xxx' where host_id=x;
Exit the tool
\q
Restart the service on all node
service cloudera-scm-agent restart
Restart the service on master node
service cloudera-scm-server restart
Turns out its better to:
Decommission the server from the cluster to ensure that all blocks are replicated to other nodes in the cluster.
Remove the server from the cluster
Connect to the server and change the IP address then restart the cloudera agent
Notice that cloudera manager now shows two entries for this server. Delete the entry with the old IP and longest heartbeat time
Add the server to the required cluster and add required roles back to the server (e.g. HDFS datanode, HBASE RS, Yarn)
HDFS will read all data disks and recognize the block pool and cluster IDs, then register the datanode.
All data will be available and the process will be transparent to any client.
NOTE: If you run into name resolution errors from HDFS clients, the application has likely cached the old IP and will most likely need be restarted. Particularly Java clients that previously referenced this server e.g. HBASE clients must be restarted due to the JVM caching IPs indefinitely. Java based clients will likely throw errors relating to connectivity to the server with changed IP because they have the old IP cached until they are restarted.

Resources