Logstash keeps doing s3 input task but nerver send ouput events - elasticsearch

I have issue with my logstash s3 iinput. The last messages I see in my kibana iinterface is from several days earlier In fact I have an AWS elb with logs enable. I've tested from command line and I can see that logsstash is continuously processing inputs, and never outputs. In the elb s3 bucket there is one folder per day/per month/per year and each folder contains several log files and with a total size of arround 60GB.
It was working fine at the begining, but as logs grow, it become slow, and now I'm seeing my logs in the outpiut size. Logstah is keeping doing input task, filter, and never output logs.
I created a dedicated configuration file for test with only s3 as input, and test in a dedicated machine from command line :
/opt/logstash/bin/logstash agent -f /tmp/s3.conf --debug 2>&1 | tee /tmp/logstash.log
```
the s3.conf file :
```
admin#ip-10-3-27-129:~$ cat /tmp/s3.conf
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# !!!!!!!!! This file is managed by SALT !!!!!!!!!
# !!!!!!!!! All changes will be lost !!!!!!!!!
# !!!!!!!!! DO NOT EDIT MANUALLY ! !!!!!!!!!
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#--[ INPUT ]----------------------------------------------------------------
input
{
# Logs ELB API
s3
{
bucket => "s3.prod.elb.logs.eu-west-1.mydomain"
prefix => "rtb/smaato/AWSLogs/653589716289/elasticloadbalancing/"
interval => 30
region => "eu-west-1"
type => "elb_access_log"
}
}
#--[ FILTER ]---------------------------------------------------------------
filter
{
# Set the HTTP request time to #timestamp field
date {
match => [ "timestamp", "ISO8601" ]
remove_field => [ "timestamp" ]
}
# Parse the ELB access logs
if [type] == "elb_access_log" {
grok {
match => [ "message", "%{TIMESTAMP_ISO8601:timestamp:date} %{HOSTNAME:loadbalancer} %{IP:client_ip}:%{POSINT:client_port:int} (?:%{IP:backend_ip}:%{POSINT:backend_port:int}|-) %{NUMBER:request_processing_time:float} %{NUMBER:backend_processing_time:float} %{NUMBER:response_processing_time:float} %{INT:backend_status_code:int} %{INT:received_bytes:int} %{INT:sent_bytes:int} %{INT:sent_bytes_ack:int} \"%{WORD:http_method} %{URI:url_asked} HTTP/%{NUMBER:http_version}\" \"%{GREEDYDATA:user_agent}\" %{NOTSPACE:ssl_cipher} %{NOTSPACE:ssl_protocol}" ]
remove_field => [ "message" ]
}
kv {
field_split => "&?"
source => "url_asked"
}
date {
match => [ "timestamp", "ISO8601" ]
remove_field => [ "timestamp" ]
}
}
# Remove the filebeat input tag
mutate {
remove_tag => [ "beats_input_codec_plain_applied" ]
}
# Remove field tags if empty
if [tags] == [] {
mutate {
remove_field => [ "tags" ]
}
}
# Remove some unnecessary fields to make Kibana cleaner
mutate {
remove_field => [ "#version", "count", "fields", "input_type", "offset", "[beat][hostname]", "[beat][name]", "[beat]" ]
}
}
#--[ OUTPUT ]---------------------------------------------------------------
output
#{
# elasticsearch {
# hosts => ["10.3.16.75:9200"]
# }
#}
{
# file {
# path => "/tmp/logastash/elb/elb_logs.json"
# }
stdout { codec => rubydebug }
}
And I can see input processing, filter, and the message like "will start output worker....." but not output event received, never.
I created a new folder (named test_elb) on the bucket, and copy logs from one day folder (31/12/2016 for example) into it, and then set the new created as prefix in my input configuration like this :
s3
{
bucket => "s3.prod.elb.logs.eu-west-1.mydomain"
prefix => "rtb/smaato/AWSLogs/653589716289/test_elb/"
interval => 30
region => "eu-west-1"
type => "elb_access_log"
}
And with that s3 prefix, logstash is doing all the pipeline processing (input, filter, output) as expecting, and I see my logs outputs.
so for me it seems like the bucket is to large and losgstash-s3 plugin has difficult to process it.
Can someone here advise on that problematic please?
My logstash version : 2.2.4
Operating system: Debian Jessie
I've search and ask in the discuss.elastic forum, in the elasticseach IRC chan, and no real solution.
Do you thing it could be a bucket size matter
Thanks for the help.
Regards.

Configure the s3 input plugin to move files to a bucket/path not considered in the input once processed.
While there are many files in the input bucket/path you may need to run logstash on a subset of the data until it moves the files to the processing bucket/path.
This is what I'm doing to process about .5GiB (several hundred thousand) of files per day. Logstash will pull all of the object names prior to doing any inserts so it will appear that the process is stuck if you have a huge amount of files in your bucket.
bucket => "BUCKET_NAME"
prefix => "logs/2017/09/01"
backup_add_prefix => "sent-to-logstash-"
backup_to_bucket => "BUCKET_NAME"
interval => 120
delete => true
I'm not certain how durable the process is against data loss between the bucket moves, but for logs which aren't mission critical, this process is highly efficient considering the amount of files being moved.

This behaviour is controlled by the watch_for_new_files parameter. In the default, true, setting logstash will not process the existing files, and will wait for new files to arrive.
Example:
input {
s3
{
bucket => "the-bucket-name"
prefix => "the_path/ends_with_the_slash/"
interval => 30
region => "eu-west-1"
type => "elb_access_log"
watch_for_new_files => false
}
}
output {
stdout{}
}

Related

Logstash pipeline data is not pushed to Elasticsearch

input {
file {
path => "/home/blusapphire/padma/sampledata.csv"
start_position => "beginning"
}
}
filter {
csv {
columns => [ "First_Name", "Last_Name", "Age", "Salary", "Emailid", "Gender" ]
}
}
output {
elasticsearch {
hosts => ["${ES_INGEST_HOST_02}:9200"]
index => "network"
user => "adcd"
password => "adcbdems"
}
}
This my logstash config file, and when running logstash file I'm not seeing data (which is given csv) in Elasticsearch, and index is not creating, Is there any mistake in configuration?
Based on your screenshot, it is clear that logstash has noticed your file. To get logstash to feel it is making "first contact" with the file, try:
shutdown logstash
delete the since_db file
start up logstash
Alternatively, move the file outside the monitored folder, execute the above steps and then drop the file in.
Can you also confirm that your set up works with other files - just in case?

Sending Cloudtrail gzip logs from S3 to ElasticSearch

I am relatively new to the whole of the ELK set up part, hence please bear along.
What I want to do is send the cloudtrail logs that are stored on S3 into a locally hosted (non-AWS I mean) ELK set up. I am not using Filebeat anywhere in the set up. I believe it isn't mandatory to use it. Logstash can directly deliver data to ES.
Am I right here ?
Once the data is in ES, I would simply want to visualize it in Kibana.
What I have tried so far, given that my ELK is up and running and that there is no Filebeat involved in the setup:
using the S3 logstash plugin
contents of /etc/logstash/conf.d/aws_ct_s3.conf
input {
s3 {
access_key_id => "access_key_id"
bucket => "bucket_name_here"
secret_access_key => "secret_access_key"
prefix => "AWSLogs/<account_number>/CloudTrail/ap-southeast-1/2019/01/09"
sincedb_path => "/tmp/s3ctlogs.sincedb"
region => "us-east-2"
codec => "json"
add_field => { source => gzfiles }
}
}
output {
stdout { codec => json }
elasticsearch {
hosts => ["127.0.0.1:9200"]
index => "attack-%{+YYYY.MM.dd}"
}
}
When logstash is started with the above conf, I can see all working fine. Using the head google chrome plugin, I can see that the documents are continuously getting added to the specified index.In fact when I browse it as well, I can see that there is the data I need. I am able to see the same on the Kibana side too.
The data that each of these gzip files have is of the format:
{
"Records": [
dictionary_D1,
dictionary_D2,
.
.
.
]
}
And I want to have each of these dictionaries from the list of dictionaries above to be a separate event in Kibana. With some Googling around I understand that I could use the split filter to achieve what I want to. And now my aws_ct_s3.conf looks something like :
input {
s3 {
access_key_id => "access_key_id"
bucket => "bucket_name_here"
secret_access_key => "secret_access_key"
prefix => "AWSLogs/<account_number>/CloudTrail/ap-southeast-1/2019/01/09"
sincedb_path => "/tmp/s3ctlogs.sincedb"
region => "us-east-2"
codec => "json"
add_field => { source => gzfiles }
}
}
filter {
split {
field => "Records"
}
}
output {
stdout { codec => json }
elasticsearch {
hosts => ["127.0.0.1:9200"]
index => "attack-%{+YYYY.MM.dd}"
}
}
And with this I am in fact getting the data as I need on Kibana.
Now the problem is
Without the filter in place, the number of documents that were being shipped by Logstash from S3 to Elasticsearch was in GBs, while after applying the filter it has stopped at roughly some 5000 documents alone.
I do not know what am I doing wrong here. Could someone please help ?
Current config:
java -XshowSettings:vm => Max Heap Size: 8.9 GB
elasticsearch jvm options => max and min heap size: 6GB
logstash jvm options => max and min heap size: 2GB
ES version - 6.6.0
LS version - 6.6.0
Kibana version - 6.6.0
This is how the current heap usage looks like:

Duplicate field values for grok-parsed data

I have a filebeat that captures logs from uwsgi application running in docker. The data is sent to the logstash which parses it and forwards to elasticsearch.
Here is the logstash conf file:
input {
beats {
port => 5044
}
}
filter {
grok {
match => { "log" => "\[pid: %{NUMBER:worker.pid}\] %{IP:request.ip} \{%{NUMBER:request.vars} vars in %{NUMBER:request.size} bytes} \[%{HTTPDATE:timestamp}] %{URIPROTO:request.method} %{URIPATH:request.endpoint}%{URIPARAM:request.params}? => generated %{NUMBER:response.size} bytes in %{NUMBER:response.time} msecs(?: via sendfile\(\))? \(HTTP/%{NUMBER:request.http_version} %{NUMBER:response.code}\) %{NUMBER:headers} headers in %{NUMBER:response.size} bytes \(%{NUMBER:worker.switches} switches on core %{NUMBER:worker.core}\)" }
}
date {
# 29/Oct/2018:06:50:38 +0700
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z"]
}
kv {
source => "request.params"
field_split => "&?"
target => "request.query"
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200"]
index => "test-index"
}
}
Everything was fine, but I've noticed that all values captured by the grok pattern is duplicated. Here is how it looks in kibana:
Note that the raw data like log which wasn't grok output is fine. I've seen that kv filter has allow_duplicate_values parameter, but it doesn't apply to grok.
What is wrong with my configuration? Also, is it possible to rerun grok patterns on existing data in elasticsearch?
Maybe your filebeat is already doing the job and creating these fields
Did you try to add this parameter to your grok ?
overwrite => [ "request.ip", "request.endpoint", ... ]
In order to rerun grok on already indexed data you need to use elasticsearch input plugin in order to read data from ES and re-index it after grok.

logstash indexing files multiple times?

I'm using logstash (v2.3.3-1) to index about ~800k documents from S3 to AWS ElasticSearch, and some documents are being indexed 2 or 3 times, instead of only once.
The files are static (nothing is updating them or touching them) and they're very small (each is roughly 1.1KB).
The process takes a very long time to run on a t2.micro (~1day).
The config I'm using is:
input {
s3 {
bucket => "$BUCKETNAME"
codec => "json"
region => "$REGION"
access_key_id => '$KEY'
secret_access_key => '$KEY'
type => 's3'
}
}
filter {
if [type] == "s3" {
metrics {
meter => "events"
add_tag => "metric"
}
}
}
output {
if "metric" in [tags] {
stdout { codec => line {
format => "rate: %{[events][rate_1m]}"
}
}
} else {
amazon_es {
hosts => [$HOST]
region => "$REGION"
index => "$INDEXNAME"
aws_access_key_id => '$KEY'
aws_secret_access_key => '$KEY'
document_type => "$TYPE"
}
stdout { codec => rubydebug }
}
}
I've run this twice now with the same problem (into different ES indices) and the files that are being indexed >1x are different each time.
Any suggestions gratefully received!
The s3 input is very fragile. It records the time of the last file processed, so any files that share the same timestamp will not be processed and multiple logstash instances cannot read from the same bucket. As you've seen, it's also painfully slow to determine which files to process (a good portion of the blame goes to amazon here).
s3 only works for me when I use a single logstash to read the files, and then delete (or backup to another bucket/folder) the files to keep the original bucket as empty as possible, and then setting sincedb_path to /dev/null.

How to extract variables from log file path, test log file name for pattern in Logstash?

I have AWS ElasticBeanstalk instance logs on S3 bucket.
Path to Logs is:
resources/environments/logs/publish/e-3ykfgdfgmp8/i-cf216955/_var_log_nginx_rotated_access.log1417633261.gz
which translates to :
resources/environments/logs/publish/e-[random environment id]/i-[random instance id]/
The path contains multiple logs:
_var_log_eb-docker_containers_eb-current-app_rotated_application.log1417586461.gz
_var_log_eb-docker_containers_eb-current-app_rotated_application.log1417597261.gz
_var_log_rotated_docker1417579261.gz
_var_log_rotated_docker1417582862.gz
_var_log_rotated_docker-events.log1417579261.gz
_var_log_nginx_rotated_access.log1417633261.gz
Notice that there's some random number (timestamp?) inserted by AWS in filename before ".gz"
Problem is that I need to set variables depending on log file name.
Here's my configuration:
input {
s3 {
debug => "true"
bucket => "elasticbeanstalk-us-east-1-something"
region => "us-east-1"
region_endpoint => "us-east-1"
credentials => ["..."]
prefix => "resources/environments/logs/publish/"
sincedb_path => "/tmp/s3.sincedb"
backup_to_dir => "/tmp/logstashed/"
tags => ["s3","elastic_beanstalk"]
type => "elastic_beanstalk"
}
}
filter {
if [type] == "elastic_beanstalk" {
grok {
match => [ "#source_path", "resources/environments/logs/publish/%{environment}/%{instance}/%{file}<unnecessary_number>.gz" ]
}
}
}
In this case I want to extract environment , instance and file name from path. In file name I need to ignore that random number.
Am I doing this the right way? What will be full, correct solution for this?
Another question is how can I specify fields for custom log format for particular log file from above?
This could be something like: (meta-code)
filter {
if [type] == "elastic_beanstalk" {
if [file_name] BEGINS WITH "application_custom_log" {
grok {
match => [ "message", "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" ]
}
}
if [file_name] BEGINS WITH "some_other_custom_log" {
....
}
}
}
How do I test for file name pattern?
For your first question, and assuming that #source_path contains the full path, try:
match => [ "#source_path", "logs/publish/%{NOTSPACE:env}/%{NOTSPACE:instance}/%{NOTSPACE:file}%{NUMBER}%{NOTSPACE:suffix}" ]
This will create 4 logstash field for you:
env
instance
file
suffix
More information is available on the grok man page and you should test with the grok debugger.
To test fields in logstash, you use conditionals, e.g.
if [field] == "value"
if [field] =~ /regexp/
etc.
Note that it's not always necessary to do this with grok. You can have multiple 'match' arguments, and it will (by default) stop after hitting the first one that matches. If your patterns are exclusive, this should work for you.

Resources