Logstash cluster output to Elasticseach cluster without multicast - elasticsearch

I want to run logstash -> elasticsearch with high availability and cannot find an easy way to achieve it. Please review how I see it and correct me:
Goal:
5 machines each running elasticsearch united into a single cluster.
5 machines each running logstash server and streaming data into elasticsearch cluster.
N machines under monitoring each running lumberjack and streaming data into logstash servers.
Constraint:
It is supposed to be run on PaaS (CoreOS/Docker) so multi-casting
discovery does not work.
Solution:
Lumberjack allows to specify a list of logstash servers to forward data to. Lumberjack will randomly select the target server and switch to another one if this server goes down. It works.
I can use zookeeper discovery plugin to construct elasticsearch cluster. It works.
With multi-casting each logstash server discovers and joins the elasticsearch cluster. Without multicasting it allows me to specify a single elasticsearch host. But it is not high availability. I want to output to the cluster, not a single host that can go down.
Question:
Is it realistic to add a zookeeper discovery plugin to logstash's embedded elasticsearch? How?
Is there an easier (natural) solution for this problem?
Thanks!

You could potentially run a separate (non-embedded) Elasticsearch instance within the Logstash container, but configure Elasticsearch not to store data, maybe set these as the master nodes.
node.data: false
node.master: true
You could then add your Zookeeper plugin to all Elasticsearch instances so they form the cluster.
Logstash then logs over http to the local Elasticsearch, who works out where in the 5 data storing nodes to actually index the data.
Alternatively this Q explains how to get plugins working with the embedded version of Elasticsearch Logstash output to Elasticsearch on AWS EC2

Related

How to configure filebeat for logstash cluster environment?

I am missing something very basic when I think of how Filebeat will be configured in a clustered logstash setup.
As per the article
https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html
and this architecture diagram
I think that there is some kind of load balancer in front of the logstash cluster. However, the Filebeat output documentation suggests that there must be an array of all the Logstatsh nodes specified. Using this list of nodes, Filebeat will do the load balancing from the client-side.
Also as per this GitHub issue, there is no native logstash clustering available yet.
So, my question is, what kind of setup do I need to be able to point my multiple Filebeat to one logstash service endpoint without specifying the logstash nodes in the cluster?
Is it possible?
Would having load balancer in front of Logstash cluster be of any help?
Thanks,
Manish
Since the Logstash clustering feature is still in the works and you don't want to specify all the Logstash hosts inside all your Beats configurations, then the only solution I see is to use a TCP load balancer in front of Logstash.
All your Beats would point to that load balancer endpoint and you can manage your Logstash cluster behind that load balancer as you see fit. Be aware, though, that you're adding a hop (hence latency) between your Beats and your Logstash cluster.

Logstash - is Pull possible?

We are trying to build up an ElasticSearch data collector. The ElasticSearch cluster should receive data from different servers. These servers are at other locations (and networks) than the ElasticSearch cluster. The clients are connected to the ElasticCluster via a one-way VPN connections.
As a first attempt we installed logstash on each client server to collect the data, filter it and send it to the ElasticCluster. So far it was no problem in a test environment. The problem is now that the LogStash from the client tries to establish a connection to ElasticSearch. However, this attempt is blocked by the firewall. It is however possible to open a connection from the ElasticCluster side to each client and receive the data. What we need is a way to get the data from LogStash so that we open a connection and pull the data from LogStash (PULL). Is there a way to do this without changing the VPN configuration?
Logstash push events, if your logstash instances can't initiate the connection with the elasticsearch nodes, you will need something in the middle or allow the traffic on the firewall/VPN.
For example, you can have a elasticsearch to where the logstash servers can push data and then another logstash in your main cluster environment where you will have a pipeline in which the input will be the elasticsearch in the middle, this way the data will be pulled from the elasticsearch.
edit:
As I've said in the comment, you need to have something like this image.
Here you have your servers sending data to a logstash instance, this logstash has an output to an elasticsearch instance, so it starts the connection pushing the data.
On your main cluster, where you have your elasticsearch cluster and an one way VPN that only can start a connection, you will have another logstash, this logstash will then have an input that will query the outside elasticsearch node, pulling the data.
In the logstash pipeline you can have a elasticsearch input, which queries a elasticsearch node, then send the data received to filters and outputs.
input {
elasticsearch { the elasticsearch in the middle }
}
filter {
your filters
}
output {
elasticsearch { your cluster nodes }
}
Is it clearly now?

Where will E L K and filebeat reside

I am working in a distributed environment.. I have a central machine which needs to monitor some 100 machines.
So I need to use ELK stack and keep monitoring the data.
Since elasticsearch, logstash,kibana and filebeat are independent softwares, i want to know where should i ideally place them in my distributed environment.
My approach was to keep kibana, elasticsearch in the central node and keep logstash and filebeat at individual nodes.
Logstash will send data to central node's elasticsearch search which kibana displays it.
Please let me know if this design is right.
Your design is not that bad but if you install elasticsearch on only one server, with time you will face the problem of availability.
You can do this:
Install filebeat and logstash on all the nodes.
Install elasticsearch as a cluster. That way if one node of elasticsearch goes down, another node can easily take over.
Install Kibana on the central node.
NB:
Make sure you configure filebeat to point to more than one logstash server. By so doing, if one logstash fails, filebeat can still ships logs to another server.
Also make sure your configuration of logstash points to all the node.data of your elasticsearch cluster.
You can also go further by installing kibana on says 3 nodes and attaching a load balancer to it. That way your load balancer will choose the instance of kibana that is healthy and display it.
UPDATE
With elasticsearch configured, we can configure logstash as follows:
output {
elasticsearch{
hosts => ["http://123.456.789.1:9200","http://123.456.789.2:9200"]
index => "indexname"
}
}
You don't need to add stdout { codec => rubydebug } in your configuration.
Hope this helps.

What happens if logstash sends data to elasticsearch at a rate faster than it can index?

So I have multiple hosts with logstash installed on each host. Logstash on all these hosts reads from the log files generated by the host and sends data to my single aws elasticsearch cluster.
Now considering a scenario where large quantities of logs are being generated by each host at the same time. Since logstash is installed on each host and it just forwards the data to the es cluster I assume that even if my elasticsearch cluster is not able to index it, my hosts won't be affected. Are the logs just loss in such a scenario?
Can my host machines get affected in any way?
In short, you may lose some logs on the host machines, and that's why messaging solutions like kafka are used https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html#deploying-message-queueing

Setting up an ELK cluster

I am trying to build a log pipe using RabbitMQ + ELK on Windows Servers.
RabbitMQ --> Logstash --> ElasticSearch --> Kibana.
Ideally i want to have 2 instances to RabbitMQ, 2 of Logstash, 3 of ElasticSearch and 1 Kibana.
Has anyone setup up something like this ? I know we can setup ElasticSearch cluster easily via setting the cluster name in the yml. What is the mechanism for lagstash to write to the ES cluster ?
Should i setup RabbitmQ+Logstash combos in each instance so that if MQs are behind a load balancer, each MQ will have its own logstash output instance and from there data goes to the cluster.
Technically you could write directly from Logstash to ES using elasticsearch output plugin or Elasticsearch_http output plugin(if using ES version not compatible with Logstash). That said for an enterprise scenario you would need fault tolerance and to handle volume, its a good idea to have RabbitMQ/Redis.
Your above config looks good, although input to your Rabbit cluster would be from one or many Logstash shippers(instances running on the client machines where logs live), that would point to a HA RabbitMQ cluster. Then a Logstash indexer whose input would be configured to look at the RabbitMQ queue(s)and output it to Elastic search cluster.
Hope that helps.
It's not recommended to put directelly the DATA from Logstash to ES.
ES Write is slow , so in heavy load you can loose data .
The idea is to add a proxy between Logstash and ES .
Logstash --> Proxy --> Elasticsearch
Logstash support Redis and RabbitMQ as a proxy .
This proxy can handle large Inputs and work as a queue mechanism .
Logstash is putting Redis as a primary choice (Because of simplicity of setup and monitoring).

Resources