filebeat failed to connect to elasticsearch - elasticsearch

I have Elasticsearch running on Kubernetes (EKS), with filebeat running as daemonset on Kubernetes.
Now I am trying to get the logs from other EC2 machines (outside of the EKS), so have installed exact version of filebeat on EC2 and configured it to send logs to Elasticsearch running on Kubernetes.
But not able to see any logs in Elasticsearch (Kibana). Here are the logs for filebeat
2019-08-26T18:18:16.005Z INFO instance/beat.go:292 Setup Beat: filebeat; Version: 7.2.1
2019-08-26T18:18:16.005Z INFO [index-management] idxmgmt/std.go:178 Set output.elasticsearch.index to 'filebeat-7.2.1' as ILM is enabled.
2019-08-26T18:18:16.005Z INFO elasticsearch/client.go:166 Elasticsearch url: http://elasticsearch.dev.domain.net:9200
2019-08-26T18:18:16.005Z INFO add_cloud_metadata/add_cloud_metadata.go:351 add_cloud_metadata: hosting provider type detected as aws, metadata={"availability_zone":"us-west-2a","instance":{"id":"i-0185e1d68306f95b4"},"machine":{"type":"t2.medium"},"provider":"aws","region":"us-west-2"}
2019-08-26T18:18:16.005Z INFO [publisher] pipeline/module.go:97 Beat name: dev-web1
2019-08-26T18:18:16.006Z INFO elasticsearch/client.go:166 Elasticsearch url: http://elasticsearch.dev.domain.net:9200
Not much info in the logs.
Then I notice :
root#dev-web1:~# sudo systemctl status filebeat
● filebeat.service - Filebeat sends log files to Logstash or directly to Elasticsearch.
Loaded: loaded (/lib/systemd/system/filebeat.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-08-26 18:18:47 UTC; 18min ago
Docs: https://www.elastic.co/products/beats/filebeat
Main PID: 7768 (filebeat)
CGroup: /system.slice/filebeat.service
└─7768 /usr/share/filebeat/bin/filebeat -e -c /etc/filebeat/filebeat.yml -path.home /usr/share/filebeat -path.config /etc/filebeat -path.data /var/lib/filebeat -path.logs
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z ERROR pipeline/output.go:100 Failed to connect to backoff(elasticsearch(http://elasticsear
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z INFO pipeline/output.go:93 Attempting to reconnect to backoff(elasticsearch(http://elastic
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z INFO [publisher] pipeline/retry.go:189 retryer: send unwait-signal to consumer
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:191 done
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:166 retryer: send wait signal to consumer
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:168 done
Aug 26 18:35:47 dev-web1 filebeat[7768]: 2019-08-26T18:35:47.028Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitori
Aug 26 18:36:17 dev-web1 filebeat[7768]: 2019-08-26T18:36:17.028Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitori
root#dev-web1:~#
But I can't read complete line in above status message.
So I tried :
root#dev-web1:~# curl elasticsearch.dev.domain.net/_cat/health
1566844775 18:39:35 dev-eks-logs green 3 3 48 24 0 0 0 0 - 100.0%
root#dev-web1:~#
which worked but not with port
root#dev-web1:~# curl elasticsearch.dev.domain.net:9200/_cat/health
filebeat has following config
output.elasticsearch:
hosts: ["elasticsearch.dev.domain.net"]
username: "elastic"
password: "changeme"
How can I fix this at filebeat side ?
Telnet Test :
root#dev-web1:~# telnet <ip> 5044
Trying <ip>...
telnet: Unable to connect to remote host: Connection refused
root#dev-web1:~# telnet localhost 5044
Trying 127.0.0.1...
telnet: Unable to connect to remote host: Connection refused
root#dev-web1:~#

https://www.elastic.co/guide/en/beats/filebeat/current/elasticsearch-output.html#hosts-option says:
hosts...If no port is specified, 9200 is used.
Adding hosts: ["elasticsearch.dev.domain.net:80"] in the filbeat configuration should resolve the issue.

I think is a problem of network , check A telnet to localhost/IP 5044

Related

HDFS NFS startup error: “ERROR mount.MountdBase: Failed to start the TCP server...ChannelException: Failed to bind..."

Attempting to use / startup HDFS NFS following the docs (ignoring the instructions to stop the rpcbind service and did not start the hadoop portmap service given that the OS is not SLES 11 and RHEL 6.2), but running into error when trying to set up the NFS service starting the hdfs nfs3 service:
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service nfs status
Redirecting to /bin/systemctl status nfs.service
Unit nfs.service could not be found.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service nfs stop
Redirecting to /bin/systemctl stop nfs.service
Failed to stop nfs.service: Unit nfs.service not loaded.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service rpcbind status
Redirecting to /bin/systemctl status rpcbind.service
● rpcbind.service - RPC bind service
Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-07-23 13:48:54 HST; 28s ago
Process: 27337 ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS (code=exited, status=0/SUCCESS)
Main PID: 27338 (rpcbind)
CGroup: /system.slice/rpcbind.service
└─27338 /sbin/rpcbind -w
Jul 23 13:48:54 HW02.ucera.local systemd[1]: Starting RPC bind service...
Jul 23 13:48:54 HW02.ucera.local systemd[1]: Started RPC bind service.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# hdfs nfs3
19/07/23 13:49:33 INFO nfs3.Nfs3Base: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting Nfs3
STARTUP_MSG: host = HW02.ucera.local/172.18.4.47
STARTUP_MSG: args = []
STARTUP_MSG: version = 3.1.1.3.1.0.0-78
STARTUP_MSG: classpath = /usr/hdp/3.1.0.0-78/hadoop/conf:/usr/hdp/3.1.0.0-78/hadoop/lib/jersey-server-1.19.jar:/usr/hdp/3.1.0.0-78/hadoop/lib/ranger-hdfs-plugin-shim-1.2.0.3.1.0.0-78.jar:
...
<a bunch of other jars>
...
STARTUP_MSG: build = git#github.com:hortonworks/hadoop.git -r e4f82af51faec922b4804d0232a637422ec29e64; compiled by 'jenkins' on 2018-12-06T12:26Z
STARTUP_MSG: java = 1.8.0_112
************************************************************/
19/07/23 13:49:33 INFO nfs3.Nfs3Base: registered UNIX signal handlers for [TERM, HUP, INT]
19/07/23 13:49:33 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
19/07/23 13:49:33 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
19/07/23 13:49:33 INFO impl.MetricsSystemImpl: Nfs3 metrics system started
19/07/23 13:49:33 INFO oncrpc.RpcProgram: Will accept client connections from unprivileged ports
19/07/23 13:49:33 INFO security.ShellBasedIdMapping: Not doing static UID/GID mapping because '/etc/nfs.map' does not exist.
19/07/23 13:49:33 INFO nfs3.WriteManager: Stream timeout is 600000ms.
19/07/23 13:49:33 INFO nfs3.WriteManager: Maximum open streams is 256
19/07/23 13:49:33 INFO nfs3.OpenFileCtxCache: Maximum open streams is 256
19/07/23 13:49:34 INFO nfs3.DFSClientCache: Added export: / FileSystem URI: / with namenodeId: -1408097406
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Configured HDFS superuser is
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Delete current dump directory /tmp/.hdfs-nfs
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Create new dump directory /tmp/.hdfs-nfs
19/07/23 13:49:34 INFO nfs3.Nfs3Base: NFS server port set to: 2049
19/07/23 13:49:34 INFO oncrpc.RpcProgram: Will accept client connections from unprivileged ports
19/07/23 13:49:34 INFO mount.RpcProgramMountd: FS:hdfs adding export Path:/ with URI: hdfs://hw01.ucera.local:8020/
19/07/23 13:49:34 INFO oncrpc.SimpleUdpServer: Started listening to UDP requests at port 4242 for Rpc program: mountd at localhost:4242 with workerCount 1
19/07/23 13:49:34 ERROR mount.MountdBase: Failed to start the TCP server.
org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:4242
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at org.apache.hadoop.oncrpc.SimpleTcpServer.run(SimpleTcpServer.java:89)
at org.apache.hadoop.mount.MountdBase.startTCPServer(MountdBase.java:83)
at org.apache.hadoop.mount.MountdBase.start(MountdBase.java:98)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startServiceInternal(Nfs3.java:56)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startService(Nfs3.java:69)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.main(Nfs3.java:79)
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
...
...
19/07/23 13:49:34 INFO util.ExitUtil: Exiting with status 1: org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:4242
19/07/23 13:49:34 INFO nfs3.Nfs3Base: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down Nfs3 at HW02.ucera.local/172.18.4.47
************************************************************/
Not sure how to interpret any of the errors seen here (and have not installed any packages like nfs-utils, assuming that Ambari would have installed all needed packages when cluster was initially installed).
Any debugging suggestions or solutions for what to do about this?
** UPDATE:
After looking at the error, I can see
Caused by: java.net.BindException: Address already in use
and looking into what is already using it, we see...
[root#HW02 ~]# netstat -ltnp | grep 4242
tcp 0 0 0.0.0.0:4242 0.0.0.0:* LISTEN 98067/jsvc.exec
The process jsvc.exec appears to be related to running java applications. Given that hadoop runs on java, I assume it would be bad to just kill the process. Is it not supposed to be on this port (since interferes with NFS Gateway)? Not sure what to do about this.
TLDR: nfs gateway service was already running (by default, apparently) and the service that I thought was blocking the hadoop nfs3 service (jsvc.exec) from starting was (I'm assuming) part of that service already running.
What made me suspect this was that when shutting down the cluster, the service also stopped plus the fact that it was using the port I needed for nfs. The way that I confirmed this was just from following the verification steps in the docs and seeing that my output was similar to what should be expected.
[root#HW02 ~]# rpcinfo -p hw02
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100005 1 udp 4242 mountd
100005 2 udp 4242 mountd
100005 3 udp 4242 mountd
100005 1 tcp 4242 mountd
100005 2 tcp 4242 mountd
100005 3 tcp 4242 mountd
100003 3 tcp 2049 nfs
[root#HW02 ~]# showmount -e hw02
Export list for hw02:
/ *
Another thing that could told me that the jsvc process was part of an already running hdfs nfs service would have been checking the process info...
[root#HW02 ~]# ps -feww | grep jsvc
root 61106 59083 0 14:27 pts/2 00:00:00 grep --color=auto jsvc
root 163179 1 0 12:14 ? 00:00:00 jsvc.exec -Dproc_nfs3 -outfile /var/log/hadoop/root/hadoop-hdfs-root-nfs3-HW02.ucera.local.out -errfile /var/log/hadoop/root/privileged-root-nfs3-HW02.ucera.local.err -pidfile /var/run/hadoop/root/hadoop-hdfs-root-nfs3.pid -nodetach -user hdfs -cp /usr/hdp/3.1.0.0-78/hadoop/conf:...
...
hdfs 163193 163179 0 12:14 ? 00:00:17 jsvc.exec -Dproc_nfs3 -outfile /var/log/hadoop/root/hadoop-hdfs-root-nfs3-HW02.ucera.local.out -errfile /var/log/hadoop/root/privileged-root-nfs3-HW02.ucera.local.err -pidfile /var/run/hadoop/root/hadoop-hdfs-root-nfs3.pid -nodetach -user hdfs -cp /usr/hdp/3.1.0.0-78/hadoop/conf:...
and seeing jsvc.exec -Dproc_nfs3 ... to get the hint that jsvc (which apparently is for running java apps on linux) was being used to run the very nfs3 service I was trying to start.
And for anyone else with this problem, note that I did not stop all the services that the docs want you to stop (since using centos7)
[root#HW01 /]# service nfs status
Redirecting to /bin/systemctl status nfs.service
● nfs-server.service - NFS server and services
Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; disabled; vendor preset: disabled)
Active: inactive (dead)
[root#HW01 /]# service rpcbind status
Redirecting to /bin/systemctl status rpcbind.service
● rpcbind.service - RPC bind service
Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2019-07-19 15:17:02 HST; 6 days ago
Main PID: 2155 (rpcbind)
CGroup: /system.slice/rpcbind.service
└─2155 /sbin/rpcbind -w
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Also note that I did not follow any of the config file settings recommended in the docs (and that some of the properties instructed in the docs could not even be found in the Ambari-managed HDFS configs (so if anyone can explain why this is still working for me despite that, please do)).
** Update:
After talking with some people more experienced with using HDP (v3.1) than me, the docs that I linked to for setting up NFS for HDFS may not be totally up to date (when setting up NFS via Ambari mgnt. in any case)...
Can have a cluster node act as an NFS gateway by checking it off as a NFS node in the Ambari host management UI:
Needed configs can be set like so in the HDFS mgnt. UI...
Can confirm that HDFS NFS gateway is running by looking at the Host > Summary > Components section in Ambari...

Issue trying to het Kibana talking to Elasticsearch

I have Kibana and Elasticsearch on separate VMs.
Elasticsearch IP: 10.3.30.209
Kibana IP: 10.3.30.211
I am unable to curl the IP:PORT of Elasticsearch from Kibana:
curl 10.3.30.209:9200
curl: (7) Failed connect to 10.3.30.209:9200; Connection refused
The firewall on Elesticsearch server is listening for port 9200:
firewall-cmd --list-all
public (active)
target: default
icmp-block-inversion: no
interfaces: ens192
sources:
services: ssh dhcpv6-client
ports: 9200/tcp
protocols:
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
The Elasticsearch service was at least starting up before I modified the elasticsearch.yml file to enable listening on the elasticsearch port and bind the Kibana IP:
network.host: 10.3.30.211
# Set a custom port for HTTP:
#
# http.port: 9200
http.port: 9200
Error when starting service:
systemctl status elasticsearch
● elasticsearch.service - Elasticsearch
Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2018-06-29 14:06:49 CDT; 1s ago
▽
Docs: http://www.elastic.co
Process: 6382 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -Des.pidfile=${PID_DIR}/elasticsearch.pid -Des.default.path.home=${ES_HOME} -Des.default.path.logs=${LOG_DIR} -Des.default.path.data=${DATA_DIR} -Des.default.path.conf=${CONF_DIR} (code=exited, status=1/FAILURE)
Process: 6380 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS)
Main PID: 6382 (code=exited, status=1/FAILURE)
Jun 29 14:06:49 elasticsearch01 elasticsearch[6382]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Jun 29 14:06:49 elasticsearch01 elasticsearch[6382]: at java.lang.Thread.run(Thread.java:748)
Jun 29 14:06:49 elasticsearch01 elasticsearch[6382]: Refer to the log for complete error details.
Jun 29 14:06:49 elasticsearch01 elasticsearch[6382]: [2018-06-29 14:06:49,189][INFO ][node ] [Turac] stopping ...
Jun 29 14:06:49 elasticsearch01 elasticsearch[6382]: [2018-06-29 14:06:49,199][INFO ][node ] [Turac] stopped
Jun 29 14:06:49 elasticsearch01 elasticsearch[6382]: [2018-06-29 14:06:49,200][INFO ][node ] [Turac] closing ...
Jun 29 14:06:49 elasticsearch01 elasticsearch[6382]: [2018-06-29 14:06:49,204][INFO ][node ] [Turac] closed
Jun 29 14:06:49 elasticsearch01 systemd[1]: elasticsearch.service: main process exited, code=exited, status=1/FAILURE
Jun 29 14:06:49 elasticsearch01 systemd[1]: Unit elasticsearch.service entered failed state.
Jun 29 14:06:49 elasticsearch01 systemd[1]: elasticsearch.service failed.
NOTE: I am able to ping the Elasticsearch server from Kibana.
Any help with figuring out why this change breaks the Elasticsearch service and how I might be able to connect my Kibana server to Elasticsearch would be appreciated!

Filebeat connect Logstash always i/o timeout

Filebeat works well before I change the password of elasticsearch. By the way, I use docker-compose to start the service, here is some information about my filebeat.
Console log:
filebeat | 2017/05/11 05:21:33.020851 beat.go:285: INFO Home path: [/] Config path: [/] Data path: [//data] Logs path: [//logs]
filebeat | 2017/05/11 05:21:33.020903 beat.go:186: INFO Setup Beat:
filebeat; Version: 5.3.0
filebeat | 2017/05/11 05:21:33.021019 logstash.go:90: INFO Max Retries set to: 3
filebeat | 2017/05/11 05:21:33.021097 outputs.go:108: INFO Activated
logstash as output plugin.
filebeat | 2017/05/11 05:21:33.021908 publish.go:295: INFO Publisher name: fd2f326e51d9
filebeat | 2017/05/11 05:21:33.022092 async.go:63: INFO Flush Interval set to: 1s
filebeat | 2017/05/11 05:21:33.022104 async.go:64: INFO Max Bulk Size set to: 2048
filebeat | 2017/05/11 05:21:33.022220 modules.go:93: ERR Not loading modules. Module directory not found: /module
filebeat | 2017/05/11 05:21:33.022291 beat.go:221: INFO filebeat start running.
filebeat | 2017/05/11 05:21:33.022334 registrar.go:68: INFO No registry file found under: /data/registry. Creating a new registry file.
filebeat | 2017/05/11 05:21:33.022570 metrics.go:23: INFO Metrics logging every 30s
filebeat | 2017/05/11 05:21:33.025878 registrar.go:106: INFO Loading registrar data from /data/registry
filebeat | 2017/05/11 05:21:33.025918 registrar.go:123: INFO States Loaded from registrar: 0
filebeat | 2017/05/11 05:21:33.025970 crawler.go:38: INFO Loading Prospectors: 1
filebeat | 2017/05/11 05:21:33.026119 prospector_log.go:61: INFO Prospector with previous states loaded: 0
filebeat | 2017/05/11 05:21:33.026278 prospector.go:124: INFO Starting prospector of type: log; id: 5816422928785612348
filebeat | 2017/05/11 05:21:33.026299 crawler.go:58: INFO Loading and starting Prospectors completed. Enabled prospectors: 1
filebeat | 2017/05/11 05:21:33.026323 registrar.go:236: INFO Starting Registrar
filebeat | 2017/05/11 05:21:33.026364 sync.go:41: INFO Start sending events to output
filebeat | 2017/05/11 05:21:33.026394 spooler.go:63: INFO Starting spooler: spool_size: 2048; idle_timeout: 5s
filebeat | 2017/05/11 05:21:33.026731 log.go:91: INFO Harvester started for file: /data/logs/biz.log
filebeat | 2017/05/11 05:22:03.023313 metrics.go:39: INFO Non-zero metrics in the last 30s: filebeat.harvester.open_files=1
filebeat.harvester.running=1 filebeat.harvester.started=1 libbeat.publisher.published_events=98 registrar.writes=1
filebeat | 2017/05/11 05:22:08.028292 single.go:140: ERR Connecting error publishing events (retrying): dial tcp 47.93.121.126:5044: i/o timeout
filebeat | 2017/05/11 05:22:33.023370 metrics.go:34: INFO No non-zero metrics in the last 30s
filebeat | 2017/05/11 05:22:39.028840 single.go:140: ERR Connecting error publishing events (retrying): dial tcp 47.93.121.126:5044: i/o timeout
filebeat | 2017/05/11 05:23:03.022906 metrics.go:34: INFO No non-zero metrics in the last 30s
filebeat | 2017/05/11 05:23:11.029517 single.go:140: ERR Connecting error publishing events (retrying): dial tcp 47.93.121.126:5044: i/o timeout
filebeat | 2017/05/11 05:23:33.023450 metrics.go:34: INFO No non-zero metrics in the last 30s
filebeat | 2017/05/11 05:23:45.030202 single.go:140: ERR Connecting error publishing events (retrying): dial tcp 47.93.121.126:5044: i/o timeout
filebeat | 2017/05/11 05:24:03.022864 metrics.go:34: INFO No non-zero metrics in the last 30s
filebeat | 2017/05/11 05:24:23.030749 single.go:140: ERR Connecting error publishing events (retrying): dial tcp 47.93.121.126:5044: i/o timeout
filebeat | 2017/05/11 05:24:33.024029 metrics.go:34: INFO No non-zero metrics in the last 30s
filebeat | 2017/05/11 05:25:03.023338 metrics.go:34: INFO No non-zero metrics in the last 30s
filebeat | 2017/05/11 05:25:09.031348 single.go:140: ERR Connecting error publishing events (retrying): dial tcp 47.93.121.126:5044: i/o timeout
filebeat | 2017/05/11 05:25:33.023976 metrics.go:34: INFO No non-zero metrics in the last 30s
filebeat | 2017/05/11 05:26:03.022900 metrics.go:34: INFO No non-zero metrics in the last 30s
filebeat | 2017/05/11 05:26:11.032346 single.go:140: ERR Connecting error publishing events (retrying): dial tcp 47.93.121.126:5044: i/o timeout
filebeat | 2017/05/11 05:26:33.022870 metrics.go:34: INFO No non-zero metrics in the last 30s
filebeat.yml:
filebeat:
prospectors:
-
paths:
- /data/logs/*.log
input_type: log
document_type: biz-log
registry_file: /etc/registry/mark
output:
logstash:
enabled: true
hosts: ["logstash:5044"]
docker-compose.yml:
version: '2'
services:
filebeat:
build: ./
container_name: filebeat
restart: always
network_mode: "bridge"
extra_hosts:
- "logstash:47.93.121.126"
volumes:
- ./conf/filebeat.yml:/filebeat.yml
- /mnt/logs/appserver/app/biz:/data/logs
- ./registry:/data
Having had a similar issue, I eventually realised the culprit was not Filebeat but Logstash.
Logstash's SSL configuration didn't contain all required attributes. Setting it up using the following declaration solved the issue:
input {
beats {
port => "{{ logstash_port }}"
ssl => true
ssl_certificate_authorities => [ "{{ tls_certificate_authority_file }}" ]
ssl_certificate => "{{ tls_certificate_file }}"
ssl_key => "{{ tls_certificate_key_file }}"
ssl_verify_mode => "force_peer"
}
}
The above example works with Ansible, remember to replace placeholders between {{ and }} by the correct values.
The registry file stores the state and location information that Filebeat uses to track where it was last reading.
So you can try updating or deleting registry file
cd /var/lib/filebeat
sudo mv registry registry.bak
sudo service filebeat restart

Unable to access Couldera Manager 5 web console after installation

I am setting up a hadoop cluster(2.6) on CentOS 7 machine with three nodes, cluster is running fine now. However, I am not able to access the Cloudera manager(5.6) web console after completing the CM installation though its services seems to be running.
Below are my findings, please help me what could be the possible reasons:
All process are up and running !
[root#vm-txxxxxx1 ~]# jps
27978 ResourceManager
15368 Main
27052 Jps
27400 DataNode
27639 SecondaryNameNode
28106 NodeManager
27258 NameNode
Firewall stopped
[root#vm-txxxxx1 ~]# service iptables stop
Redirecting to /bin/systemctl stop iptables.service
[root#vm-txxxxxx1 ~]# service iptabes status
Redirecting to /bin/systemctl status iptabes.service
iptabes.service
Loaded: not-found (Reason: No such file or directory)
Active: inactive (dead)
Mar 24 19:24:05 vm-txxxxx1 systemd[1]: Stopped IPv4 firewall with iptables.
Listening on port 7180 and tested the same locally
[root#vm-txxxxxx1 ~]# netstat -tulpn | grep 7180
tcp 0 0 0.0.0.0:7180 0.0.0.0:* LISTEN 15368/java
[root#vm-txxxxx1 ~]# telnet localhost 7180
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
SELINUX Disabled:
[root#vm-txxxxxx1 ~]# getenforce
Disabled
Hostfile entries
[root#vm-txxxxxx1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4
172.16.xx.x1 vm-txxxxxx1
172.16.xx.x2 vm-xxxxxxx2
172.16.xx.x4 del1-vm-poc04
Verify if Cloudera Manager is running:
[root#vm-txxxxxx1 ~]# service cloudera-scm-server status
cloudera-scm-server.service - LSB: Cloudera SCM Server
Loaded: loaded (/etc/rc.d/init.d/cloudera-scm-server)
Active: active (exited) since Tue 2016-03-22 17:09:55 IST; 2 days ago
Process: 15344 ExecStart=/etc/rc.d/init.d/cloudera-scm-server start (code=exited, status=0/SUCCESS)
Mar 22 17:09:50 vm-txxxxxx1 systemd[1]: Starting LSB: Cloudera SCM Server...
Mar 22 17:09:50 vm-txxxxx1 su[15366]: (to cloudera-scm) root on none
Mar 22 17:09:55 vm-txxxxxx1 cloudera-scm-server[15344]: Starting cloudera-scm-server:...]
Mar 22 17:09:55 vm-txxxxxx1 systemd[1]: Started LSB: Cloudera SCM Server.
Hint: Some lines were ellipsized, use -l to show in full.
Below are the lines from Cloudera servers logs
[root#vm-txxxxx1 ~]# tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log
2016-03-24 18:21:00,398 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Reaped total of 0 deleted commands
2016-03-24 18:21:00,400 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Found no commands older than 2014-03-25T12:51:00.399Z to reap.
2016-03-24 18:21:00,400 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Wizard is active, not reaping scanners or configurators
I am accessing the Cloudera Manager page http://172.16.xx.1x:7180
at the end it says "The connection has timeout", it looks like my http request is not able to reach out to the server, that's why nothing comes up in the logs. Please suggest if I am missing something.
Thanks in advance!
#Havnar: Thanks for the suggestion, I am confirming SSL is not enabled now
and sharing the curl result.
[root#vm-txxxx1 ~]# curl -i -u 'admin:admin' http://localhost:7180/api/v1/tools/echo
HTTP/1.1 200 OK
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: CLOUDERA_MANAGER_SESSIONID=1etaj5o42vprlndf43ua7rbaf;Path=/;HttpOnly
Content-Type: application/json
Date: Fri, 25 Mar 2016 05:50:36 GMT
Transfer-Encoding: chunked
Server: Jetty(6.1.26.cloudera.4)
{
"message" : "Hello, World!"
I tried stop and restarted the cloudera service, nothing find suspicious, there was one warning which is looking little bit suspicious, search them google, nothing looks relevant.
[root#vm-txxxxx1 ~]# vi /var/log/cloudera-scm-server/cloudera-scm-server.log
2016-03-24 20:22:29,002 WARN main:org.hibernate.cache.ehcache.AbstractEhcacheRegionFactory: HHH020003: Could not find a specific ehcache configuration for cache named [org.hibernate.cache.internal.StandardQueryCache]; using defaults.
2016-03-24 20:22:28,581 INFO main:org.hibernate.engine.jdbc.internal.LobCreatorBuilder: HHH000424: Disabling contextual LOB creation as createClob() method threw error : java.lang.reflect.InvocationTargetException
#Havnar : I didn't get what do you meant by "try a cat on the machine running the CM", let me know if anything else need to be checked.
Thanks

One node in hadoop cluster failure

I have configured 10 nodes HDP hadoop cluster recently, each node is of OS SLES11..
On master node I have configured all master services and clients..also the mabari-server. Remaining nodes other slave services and their clients.
NTP sync is on, other pre-requisites also fine.
I am experiencing weird behavior on hadoop cluster, After starting all the services within few hours one of the node goes down.
When I experienced this first time, I have restarted that particular node and added back to the cluster.
Now My master node is causing the same issue due to which whole cluster is down. I have checked the logs but there are no indications related to failure.
I am clueless what is the root cause for the failure of the node in hadoop cluster?
Below are logs :-
the system which went down:
/var/log/messages
these are /var/log/messages: notice)=0', processed='source(src)=6830'
Apr 23 05:22:43 lnx1863 SuSEfirewall2: SuSEfirewall2 not active Apr 23
05:23:49 lnx1863 SuSEfirewall2: SuSEfirewall2 not active Apr 23
05:24:17 lnx1863 sudo: root : TTY=pts/0 ; PWD=/ ; USER=root ;
COMMAND=/usr/bin/du -h / Apr 23 05:24:55 lnx1863 SuSEfirewall2:
SuSEfirewall2 not active Apr 23 05:25:22 lnx1863 kernel:
[248531.127254] megasas: Found FW in FAULT state, will reset adapter.
Apr 23 05:25:22 lnx1863 kernel: [248531.127260] megaraid_sas:
resetting fusion adapter. Apr 23 05:25:22 lnx1863 kernel:
[248531.127427] megaraid_sas: Reset not supported, killing adapter.
namenode logs:-
INFO 2015-04-23 05:27:43,665 Heartbeat.py:78 - Building Heartbeat:
{responseId = 7607, timestamp = 1429781263665, commandsInProgress =
False, componentsMapped = True} INFO 2015-04-23 05:28:44,053
security.py:135 - Encountered communication error. Details:
SSLError('The read operation timed out',) ERROR 2015-04-23
05:28:44,053 Controller.py:278 - Connection to http://localhost was
lost (details=Request to
https://localhost:8441/agent/v1/heartbeat/localhostip failed due to
Error occured during connecting to the server: The read operation
timed out) INFO 2015-04-23 05:29:16,061 NetUtil.py:48 - Connecting to
https://localhost:8440/connection_info INFO 2015-04-23 05:29:16,118
security.py:93 - SSL Connect being called.. connecting to the server

Resources