Logstash - is Pull possible? - elasticsearch

We are trying to build up an ElasticSearch data collector. The ElasticSearch cluster should receive data from different servers. These servers are at other locations (and networks) than the ElasticSearch cluster. The clients are connected to the ElasticCluster via a one-way VPN connections.
As a first attempt we installed logstash on each client server to collect the data, filter it and send it to the ElasticCluster. So far it was no problem in a test environment. The problem is now that the LogStash from the client tries to establish a connection to ElasticSearch. However, this attempt is blocked by the firewall. It is however possible to open a connection from the ElasticCluster side to each client and receive the data. What we need is a way to get the data from LogStash so that we open a connection and pull the data from LogStash (PULL). Is there a way to do this without changing the VPN configuration?

Logstash push events, if your logstash instances can't initiate the connection with the elasticsearch nodes, you will need something in the middle or allow the traffic on the firewall/VPN.
For example, you can have a elasticsearch to where the logstash servers can push data and then another logstash in your main cluster environment where you will have a pipeline in which the input will be the elasticsearch in the middle, this way the data will be pulled from the elasticsearch.
edit:
As I've said in the comment, you need to have something like this image.
Here you have your servers sending data to a logstash instance, this logstash has an output to an elasticsearch instance, so it starts the connection pushing the data.
On your main cluster, where you have your elasticsearch cluster and an one way VPN that only can start a connection, you will have another logstash, this logstash will then have an input that will query the outside elasticsearch node, pulling the data.
In the logstash pipeline you can have a elasticsearch input, which queries a elasticsearch node, then send the data received to filters and outputs.
input {
elasticsearch { the elasticsearch in the middle }
}
filter {
your filters
}
output {
elasticsearch { your cluster nodes }
}
Is it clearly now?

Related

Where will E L K and filebeat reside

I am working in a distributed environment.. I have a central machine which needs to monitor some 100 machines.
So I need to use ELK stack and keep monitoring the data.
Since elasticsearch, logstash,kibana and filebeat are independent softwares, i want to know where should i ideally place them in my distributed environment.
My approach was to keep kibana, elasticsearch in the central node and keep logstash and filebeat at individual nodes.
Logstash will send data to central node's elasticsearch search which kibana displays it.
Please let me know if this design is right.
Your design is not that bad but if you install elasticsearch on only one server, with time you will face the problem of availability.
You can do this:
Install filebeat and logstash on all the nodes.
Install elasticsearch as a cluster. That way if one node of elasticsearch goes down, another node can easily take over.
Install Kibana on the central node.
NB:
Make sure you configure filebeat to point to more than one logstash server. By so doing, if one logstash fails, filebeat can still ships logs to another server.
Also make sure your configuration of logstash points to all the node.data of your elasticsearch cluster.
You can also go further by installing kibana on says 3 nodes and attaching a load balancer to it. That way your load balancer will choose the instance of kibana that is healthy and display it.
UPDATE
With elasticsearch configured, we can configure logstash as follows:
output {
elasticsearch{
hosts => ["http://123.456.789.1:9200","http://123.456.789.2:9200"]
index => "indexname"
}
}
You don't need to add stdout { codec => rubydebug } in your configuration.
Hope this helps.

What happens if logstash sends data to elasticsearch at a rate faster than it can index?

So I have multiple hosts with logstash installed on each host. Logstash on all these hosts reads from the log files generated by the host and sends data to my single aws elasticsearch cluster.
Now considering a scenario where large quantities of logs are being generated by each host at the same time. Since logstash is installed on each host and it just forwards the data to the es cluster I assume that even if my elasticsearch cluster is not able to index it, my hosts won't be affected. Are the logs just loss in such a scenario?
Can my host machines get affected in any way?
In short, you may lose some logs on the host machines, and that's why messaging solutions like kafka are used https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html#deploying-message-queueing

How to watch the logstash log?

For my enterprise application distributed and structured logging, I use logstash for log aggregation and elastic search as log storage. I have the clear control pushing logs from my application to logstash. On the other hand, from logstash to elastic search having very thin control.
Assume, if my elasticsearch goes down for some stupid reason, The logstash log(/var/log/logstash/logstash.log) is recording the reason clearly like the following one.
Attempted to send a bulk request to Elasticsearch configured at '["http://localhost:9200/"]', but Elasticsearch appears to be unreachable or down! {:client_config=>{:hosts=>["http://localhost:9200/"], :ssl=>nil, :transport_options=>{:socket_timeout=>0, :request_timeout=>0, :proxy=>nil, :ssl=>{}}, :transport_class=>Elasticsearch::Transport::Transport::HTTP::Manticore, :logger=>nil, :tracer=>nil, :reload_connections=>false, :retry_on_failure=>false, :reload_on_failure=>false, :randomize_hosts=>false}, :error_message=>"Connection refused", :class=>"Manticore::SocketException", :level=>:error}
How will I get noticed OR notified for the error level logs from logstash?
Should be doable with the following 3 steps:
1) Depends on how you want to get notified. If an email is sufficient you could use the Logstash email output-plugin.
But there are many more output plugins available.
2) To restrict certain events you can do stuff like that in your Logstash config (example is taken from the Elastic support site):
if [level] == "ERROR" {
output {
...
}
}
The if clause is not limited to the level field of your JSON; you are able to apply it for any of your JSON fields of course, which makes it more powerful.
3) To make this work (and not run into a logging cycle) you need either:
Start a second Logstash instance on your system (just observing the Logstash ERROR log), which should be okay from what is written here
Or you build a more complicated configuration, using just one Logstash instance. This configuration has to forward log-statements from YOUR application to Elasitcsearch while logstaments from Logstash ERROR logs are forwarded to the e.g. Logstash email output-plugin.
Side note: you may want to have a look at Filebeat which works very well with Logstash (Its from Elastic as well) and it is even more light-weighted than Logstash. It allows stuff like include_lines: ["^ERR", "^WARN"] in your configuration.
To receive input from Filebeat you will have to adopt the config to send data to Logstash and for Logstash you will have to active and use the Beats input plugin described here.

Multiple nodes in a Single elastic server

I am seeing multiple nodes in a single elastic server.
where I had specified to be only one.
this server is used to parse logstash logs
Probably you have connected the logstash instances with transport client. As you can see there is only one data node in the screenshot. THis way logstash instances connects to the cluster as a elastic node, but do not get index requests because they are set as data and master false.

Logstash cluster output to Elasticseach cluster without multicast

I want to run logstash -> elasticsearch with high availability and cannot find an easy way to achieve it. Please review how I see it and correct me:
Goal:
5 machines each running elasticsearch united into a single cluster.
5 machines each running logstash server and streaming data into elasticsearch cluster.
N machines under monitoring each running lumberjack and streaming data into logstash servers.
Constraint:
It is supposed to be run on PaaS (CoreOS/Docker) so multi-casting
discovery does not work.
Solution:
Lumberjack allows to specify a list of logstash servers to forward data to. Lumberjack will randomly select the target server and switch to another one if this server goes down. It works.
I can use zookeeper discovery plugin to construct elasticsearch cluster. It works.
With multi-casting each logstash server discovers and joins the elasticsearch cluster. Without multicasting it allows me to specify a single elasticsearch host. But it is not high availability. I want to output to the cluster, not a single host that can go down.
Question:
Is it realistic to add a zookeeper discovery plugin to logstash's embedded elasticsearch? How?
Is there an easier (natural) solution for this problem?
Thanks!
You could potentially run a separate (non-embedded) Elasticsearch instance within the Logstash container, but configure Elasticsearch not to store data, maybe set these as the master nodes.
node.data: false
node.master: true
You could then add your Zookeeper plugin to all Elasticsearch instances so they form the cluster.
Logstash then logs over http to the local Elasticsearch, who works out where in the 5 data storing nodes to actually index the data.
Alternatively this Q explains how to get plugins working with the embedded version of Elasticsearch Logstash output to Elasticsearch on AWS EC2

Resources