Logstash - Elasticsearch filter unable to fetch start events - elasticsearch

I'm trying to replicate the exact use case for elasticsearch filter detailed in the docs
https://www.elastic.co/guide/en/logstash/current/plugins-filters-elasticsearch.html
My output is also the same elasticsearch server.
I need to compute the time duration between two events. And the end events appear # <10ms after the start events.
What I'm observing is logstash is failing to fetch the start event for some end events.
My guess is, such start events are still buffered when logstash looks for them in ES.
I have tried setting the flush_size property to a low value in the output filter, this only helped a little. There were fewer "miss" cases when its configured to a low value. I'd tried setting it to 1 too, just to confirm this. There were still a few exit events that couldnt find their entry events.
Is there anything else that I should look for, that could possibly be causing the issue, as setting flush_size to too low a value didnt help and doesnt look like an optimal solution either.
Here's my logstash config :
filter{
elasticsearch {
hosts => ["ES_SERVER_IP:9200"]
index=>"logstash-filebeat-*"
query => "event:ENTRY AND id:%{[id]}"
fields => {"log-timestamp" => "started"}
sort => ["#timestamp:desc"]
}
ruby {
code => "event['processing_time'] = event['log-timestamp'] - event['started']"
}
}
output{
elasticsearch{
hosts=>["ES_SERVER_IP:9200"]
}
}

Related

Many Logstash instances reading from Redis

I have one Logstash process running inside one node consuming from a Redis list, but I'm afraid that just one process cannot handle the data throughput without a great delay.
I was wondering if I run one more process for Logstash inside this same machine will perform a little better, but I'm not certain about that. I know that my ES index is not a bottleneck.
Would Logstash duplicate my data, if I consume the same list? This approach seems to be a right thing to do?
Thanks!
Here my input configuration:
input {
redis {
data_type => "list"
batch_count => 300
key => "flight_pricing_stats"
host => "my-redis-host"
}
}
You could try adjusting logstash input threads, if you are going to run another logstash process in the same machine. Default is 1.
input {
redis {
data_type => "list"
batch_count => 300
key => "flight_pricing_stats"
host => "my-redis-host"
threads => 2
}
}
You could run more than one logstash against the same redis, events should not get duplicated. But I'm not sure that would help.
If you're not certain whats going on, I recommend the logstash monitoring API. It can help you narrow down your real bottlenck.
And also an interesting post from elastic on the subject: Logstash Lines Introducing a benchmarking tool for Logstash

extracting data from multiple events from Elasticsearch using single logstash filter

I have log lines loaded in ElasticSearch which has the data scattered in multiple events, say event_id is in event (line) number 5 and event_action is available in event number 88, further event_port information is available in event number 455. How can i extract this data so that my output looks like following. For this case multiline codec will not work.
{
event_id: 1223
event_action: "socket_open"
event_port: 76654
}
Currently I have the log files persisted so i can get the file path from ES. I tried to have a shell script executed from ruby filter, this shell script can perform grep commands and put the stdout data in a new event, like following.
input {
elasticsearch {
hosts => "localhost:9200"
index => "my-logs"
}
}
filter
{
ruby {
code => 'require "open3"
file_path = event.get("file_path")
cmd = "my_filter.sh -f #{file_path}"
stdin, stdout, stderr = Open3.popen3(cmd)
event.set("process_result", stdout.read)
err = stderr.read
if err.to_s.empty?
filter_matched(event)
else
event.set("ext_script_err_msg", err)
end'
remove_field => ["file_path"]
}
}
With above approach I am facing problems.
1) Doing grep in huge files can be time consuming. Is there any alternative, whithout having to grep on files?
2) My input plugin (attached above) is taking events from Elastic Search which has file_path set for "ALL" events in an index, this makes my_filter.sh to be executed multiple times which is something I want to avoid. How can I extract unique file_path from ES?
Elasticsearch was not made to build output stream depending on input. Elastic is a noSQL database, where the data should be consumed over the time (in a real time approach). It means that you should first store everything in your Elasticsearch, and process the data after. In your case, you are stucking the flow by waiting different events.
If you really need to catch these events and process it in background, you could try something like nxlog before filtering in logstash (input is nxlog) or a python script (use it as a filter in logstash). In your case I would pre-process my data to consolidate it, and then send it to logstash

Ship only a percentage of logs to logstash

How can I configure filebeat to only ship a percentage of logs (a sample if you will) to logstash?
In my application's log folder the logs are chunked to about 20 megs each. I want filebeat to ship only about 1/300th of that log volume to logstash.
I need to pare down the log volume before I send it over the wire to logstash so I cannot do this filtering from logstash it needs to happen on the endpoint before it leaves the server.
I asked this question in the ES forum and someone said it was not possible with filebeat: https://discuss.elastic.co/t/ship-only-a-percentage-of-logs-to-logstash/77393/2
Is there really no way I can extend filebeat to do this? Can nxlog or another product to this?
To the best of my knowledge, there is no way to do that with FileBeat. You can do it with Logstash, though.
filter {
drop {
percentage => 99.7
}
}
This may be a use-case where you would use Logstash in shipping mode on the server, rather than FileBeat.
input {
file {
path => "/var/log/hugelogs/*.log"
add_tags => [ 'sampled' ]
}
}
filter {
drop {
percentage => 99.7
}
}
output {
tcp {
host => 'logstash.prod.internal'
port => '3390'
}
}
It means installing Logstash on your servers. However, you configure it as minimally as possible. Just an input, enough filters to get your desired effect, and a single output (Tcp in this case, but it could be anything). Full filtering will happen down the pipeline.
There's no way to configure Filebeat to drop arbitrary events based on a probability. But Filebeat does have the ability to drop events based on conditions. There are two way to filter events.
Filebeat has a way to specify lines to include or exclude when reading the file. This is the most efficient place to apply the filtering because it happens early. This is done using include_lines and exclude_lines in the config file.
filebeat.prospectors:
- paths:
- /var/log/myapp/*.log
exclude_lines: ['^DEBUG']
All Beats have "processors" that allow you to apply an action based on a condition. One action is drop_events and the conditions are regexp, contains, equals, and range.
processors:
- drop_event:
when:
regexp:
message: '^DEBUG'

How to add dynamic hosts in Elasticsearch and logstash

I have prototype working for me with Devices sending logs and then logstash parsing it and putting into elasticsearch.
Logstash output code :-
output{
if [type] == "json" {
elasticsearch {
hosts => ["host1:9200","host2:9200","host3:9200"]
index => "index-metrics-%{+xxxx.ww}"
}
}
}
Now My Question is :
I will be producing this solution. For simplicity assume that I have one Cluster and I have right now 5 nodes inside that cluster.
So I know I can give array of 5 nodes IP / Hostname in elasticsearch output plugin and then it will round robin to distribute data.
How can I avoid putting all my node IP / hostnames into logstash config file.
As system goes into production I don't want to manually go into each logstash instance and update these hosts.
What are the best practices one should follow in this case ?
My requirement is :
I want to run my ES cluster and I want to add / remove / update any number of node at any time. I need all of my logstash instances send data irrespective of changes at ES side.
Thanks.
If you want to add/remove/update you will need to run sed or some kind of string replacement before the service startup. Logstash configs are "compiled" and cannot be changed that way.
hosts => [$HOSTS]
...
$ HOSTS="\"host1:9200\",\"host2:9200\""
$ sed "s/\$HOSTS/$HOSTS/g" $config
Your other option is to use environment variables for the dynamic portion, but that won't allow you to use a dynamic amount of hosts.

logstash with elasticsearch_http

Apparently logstash OnDemand account does not work when I wanted to post an issue.
Anyways, I have a logstash setup with redis, elasticsearch, and kibana. My logstash are collecting logs from several files and putting in redis just fine.
Logstash version 1.3.3
Elasticsearch version 1.0.1
The only thing I have in elasticsearch_http for logstash is the host name. This all setup seems to glue together just fine.
The problem is that the elasticsearch_http is not consuming the redis entries as they come. What I have seen by running it in debug mode is that it flush about 100 entries after every 1 min (flush_size and idle_flush_time's default values). The documentation however states, from what I understand is, that it will force a flush in case the 100 flush_size is not satisfied (for example we had 10 messages in last 1 min). But it seems to work the other way. Its flushing about 100 messages every 1 min only. I changed the size to 2000 and it flush 2000 every min or so.
Here is my logstash-indexer.conf
input {
redis {
host => "1xx.xxx.xxx.93"
data_type => "list"
key => "testlogs"
codec => json
}
}
output {
elasticsearch_http {
host => "1xx.xxx.xxx.93"
}
}
Here is my elasticsearch.yml
cluster.name: logger
node.name: "logstash"
transport.tcp.port: 9300
http.port: 9200
discovery.zen.ping.unicast.hosts: ["1xx.xxx.xxx.93:9300"]
discovery.zen.ping.multicast.enabled: false
#discovery.zen.ping.unicast.enabled: true
network.bind_host: 1xx.xxx.xxx.93
network.publish_host: 1xx.xxx.xxx.93
The indexer, elasticsearch, redis, and kibana are on same server. The log collection from file is done on another server.
So I'm going to suggest a couple of different approaches to solve you problem. Logstash as you are discovering can be a bit quirky so I've found a these approaches useful in dealing with unexpected behavior from logstash.
Use the elasticsearch output instead of elasticsearch_http. You
can get the same functionality by using elasticsearch output with
protocol set to http. The elasticsearch output is more mature
(milestone 2 vs milestone 3) and I've seen this change make a
difference before.
Set the defaults for idle_flush_time and flush_size. There have
been issues with Logstash defaults previously, I've found it to be a
lot safer to set them explicitly. idle_flush_time is in seconds,
flush_size is the number of records to flush.
Upgrade to more recent versions of logstash. There is
enough of a change in how logstash is deployed with version 1.4.X
(http://logstash.net/docs/1.4.1/release-notes) that I'd that I'd
bite the bullet and upgrade. It's also significantly easier to get
attention if you still have a problem with the most recent stable
major release.
Make certain your Redis version matches those support by your
logstash version.
Experiment with setting the batch, batch_events and batch_timeout
values for the Redis output. You are using the list data_type.
list supports various batch options and as with some other
parameters it's best not to assume the defaults are always being set
correctly.
Do all of the above. In addition to trying the first set of
suggestions, I'd try all of them together in various combinations.
Keep careful records of each test run. Seems obvious but between all
the variations above it's easy to lose track - I'd keep careful
records and try to change only one variation at a time.

Resources