I am new to elastic search and have spent a long time trying to solve the question below. Perhaps the solution should be in the documentation - but it is not :-(
I have servers running in multiple time zones.
The log files get rsynced into servers with different time zones, but it is easy to know the origin time zone either by a timezone field e.g. {"timezone": "UTC"} or by the time format itself e.g. {"#timestamp": "2015-02-20T12:11:56.789Z"}
I have full control over the log files and can adapt them if necessary.
When using logstash - it changes the time format to the local time of the server that it is running on. e.g. "#timestamp" => "2015-02-21T22:26:24.920-08:00"
How can I get the timezone consistently taken from the source log file, through log stash and into elasticsearch? (Obviously - I will want to have it in Kibana after that). I have tried many things with no success.
Thanks in advance.
My goal was to create _id in elasticsearch that has the logging time in it - so that it will never be repeated even if the log is sent again through logstash
After throwing a few more hours at the problem - I have some conclusions that as far as I am concerned are not well enough documented, and recommended work around.
1) If the format of the log file has time zone in it - there is nothing that can be done to modify it in logstash. Therefore - don't waste time on timezones or partial matching or adding timezone. If the time has a Z at the end - then it will be GMT. I think that it is a bug that when this happens - no warning is issued.
2) Logstash outputs to standard output / file with the time in its local time regardless of the format of the input string.
3) Logstash uses the time in its local time - so concatenating the time into a variable gets messed up - even if the original string was GMT. so just don't even try to work with the #timestamp variable !!!
4) elastic search works in GMT - so it behaves properly. So what you see in the output of logstash as "#timestamp" => "2015-02-21T20:26:24.921-08:00" gets properly interpreted by elastic search as "#timestamp" => "2015-02-21T12:26:24.921Z"
So my work around is as follows:
1) keep the logs with a timestamp that is NOT #timestamp
2) consistently save time in the log files as GMT and mark them with trailing Z
3) use the date filter in its most basic form. No timezone attribute
filter {
date {
match => ["log_time", "YYYY-MM-dd'T'HH:mm:ss.SSSZ"]
#timezone => "Etc/GMT-8" <--- THIS DOES NOT WORK IF THERE IS A Z IN SOURCE
}
}
4) create time derivatives straight from the log variable - not from the #timestamp. e.g.
output {
stdout { codec => rubydebug }
elasticsearch {
host => localhost
document_id => "%{log_time}-%{host}" # <--- DO THIS
# document_id => "%{#timestamp}-%{host}" <--- DON'T DO THIS
}
}
If Jordan Sissel happens to read this - I believe that logstash should be consistent with elasticsearch as a default - or at least have an option to output and work internally in GMT. I had a rocky start doing what every one goes through when trying out the tool for the 1st time with existing logs.
Related
So I have a log file that is taken as an input by filebeat which is then outputted to logstash and from there to elastic search. I need to calculate the time required/elapsed from the time I start filbeat and its reading process till the time it reaches elastic search.
The pipeline is logfile-->filebeat-->logstash-->els. I need to find time elapsed from filbeat to els. I'm new to ELK and don't know how to use grok or similar features yet.
I have configured filebeat.yml as follows:
filebeat.inputs:
type: log
paths:
/home/user/us/logs/lgfile.json
output.logstash:
hosts: ["localhost:port"]
and logstash conf file as:
input {
beats {
port => "5044"
}
}
output {
stdout { codec => rubydebug }
elasticsearch {
hosts => ["localhost:port"]
index => "elkpipego"
}
}
So far I have manually done this time calculation using a stopwatch from the time logstash reads the first msg and then used the difference with the timestamp of the last message. But this is clearly not so accurate and I would like to know if there's a way using any feature/grok/tool like jmeter/esrally or anything where I could accurately find the exact time elapsed till the last message is stashed on elastic search? TIA
How can I configure filebeat to only ship a percentage of logs (a sample if you will) to logstash?
In my application's log folder the logs are chunked to about 20 megs each. I want filebeat to ship only about 1/300th of that log volume to logstash.
I need to pare down the log volume before I send it over the wire to logstash so I cannot do this filtering from logstash it needs to happen on the endpoint before it leaves the server.
I asked this question in the ES forum and someone said it was not possible with filebeat: https://discuss.elastic.co/t/ship-only-a-percentage-of-logs-to-logstash/77393/2
Is there really no way I can extend filebeat to do this? Can nxlog or another product to this?
To the best of my knowledge, there is no way to do that with FileBeat. You can do it with Logstash, though.
filter {
drop {
percentage => 99.7
}
}
This may be a use-case where you would use Logstash in shipping mode on the server, rather than FileBeat.
input {
file {
path => "/var/log/hugelogs/*.log"
add_tags => [ 'sampled' ]
}
}
filter {
drop {
percentage => 99.7
}
}
output {
tcp {
host => 'logstash.prod.internal'
port => '3390'
}
}
It means installing Logstash on your servers. However, you configure it as minimally as possible. Just an input, enough filters to get your desired effect, and a single output (Tcp in this case, but it could be anything). Full filtering will happen down the pipeline.
There's no way to configure Filebeat to drop arbitrary events based on a probability. But Filebeat does have the ability to drop events based on conditions. There are two way to filter events.
Filebeat has a way to specify lines to include or exclude when reading the file. This is the most efficient place to apply the filtering because it happens early. This is done using include_lines and exclude_lines in the config file.
filebeat.prospectors:
- paths:
- /var/log/myapp/*.log
exclude_lines: ['^DEBUG']
All Beats have "processors" that allow you to apply an action based on a condition. One action is drop_events and the conditions are regexp, contains, equals, and range.
processors:
- drop_event:
when:
regexp:
message: '^DEBUG'
I'm trying to replicate the exact use case for elasticsearch filter detailed in the docs
https://www.elastic.co/guide/en/logstash/current/plugins-filters-elasticsearch.html
My output is also the same elasticsearch server.
I need to compute the time duration between two events. And the end events appear # <10ms after the start events.
What I'm observing is logstash is failing to fetch the start event for some end events.
My guess is, such start events are still buffered when logstash looks for them in ES.
I have tried setting the flush_size property to a low value in the output filter, this only helped a little. There were fewer "miss" cases when its configured to a low value. I'd tried setting it to 1 too, just to confirm this. There were still a few exit events that couldnt find their entry events.
Is there anything else that I should look for, that could possibly be causing the issue, as setting flush_size to too low a value didnt help and doesnt look like an optimal solution either.
Here's my logstash config :
filter{
elasticsearch {
hosts => ["ES_SERVER_IP:9200"]
index=>"logstash-filebeat-*"
query => "event:ENTRY AND id:%{[id]}"
fields => {"log-timestamp" => "started"}
sort => ["#timestamp:desc"]
}
ruby {
code => "event['processing_time'] = event['log-timestamp'] - event['started']"
}
}
output{
elasticsearch{
hosts=>["ES_SERVER_IP:9200"]
}
}
I have prototype working for me with Devices sending logs and then logstash parsing it and putting into elasticsearch.
Logstash output code :-
output{
if [type] == "json" {
elasticsearch {
hosts => ["host1:9200","host2:9200","host3:9200"]
index => "index-metrics-%{+xxxx.ww}"
}
}
}
Now My Question is :
I will be producing this solution. For simplicity assume that I have one Cluster and I have right now 5 nodes inside that cluster.
So I know I can give array of 5 nodes IP / Hostname in elasticsearch output plugin and then it will round robin to distribute data.
How can I avoid putting all my node IP / hostnames into logstash config file.
As system goes into production I don't want to manually go into each logstash instance and update these hosts.
What are the best practices one should follow in this case ?
My requirement is :
I want to run my ES cluster and I want to add / remove / update any number of node at any time. I need all of my logstash instances send data irrespective of changes at ES side.
Thanks.
If you want to add/remove/update you will need to run sed or some kind of string replacement before the service startup. Logstash configs are "compiled" and cannot be changed that way.
hosts => [$HOSTS]
...
$ HOSTS="\"host1:9200\",\"host2:9200\""
$ sed "s/\$HOSTS/$HOSTS/g" $config
Your other option is to use environment variables for the dynamic portion, but that won't allow you to use a dynamic amount of hosts.
I have connected logstash, Elasticsearch and Kibana. It all works fine.
I used logstash to take the tomcat logs.
input {
file {
path => "/tom_logs/*"
type => "tomcat"
start_position => "end"
}
}
Once i updated the log file, It takes the whole logs in the file instead of updated log. I just want to load the log which is last updated.
Any one help me.
Thanks in advance
Your problem is a bit strange because I never experienced it. To be sure that I understand correctly : when a new log comes, logstash start analysing again all the logs in the file ?
You correctly specify the start_position=>"end" which is actually the default option. In this case, logstash must consider only new changes in the file (so, new logs) since its start-up.
So, I think the issue of this "bug" is not in logstash but in "how" tomcat writes logs... But if I were you, I'd try to specify path=>"tom_logs/*.log" instead of * only.
Hope it will help.