Sending Cloudtrail gzip logs from S3 to ElasticSearch - elasticsearch

I am relatively new to the whole of the ELK set up part, hence please bear along.
What I want to do is send the cloudtrail logs that are stored on S3 into a locally hosted (non-AWS I mean) ELK set up. I am not using Filebeat anywhere in the set up. I believe it isn't mandatory to use it. Logstash can directly deliver data to ES.
Am I right here ?
Once the data is in ES, I would simply want to visualize it in Kibana.
What I have tried so far, given that my ELK is up and running and that there is no Filebeat involved in the setup:
using the S3 logstash plugin
contents of /etc/logstash/conf.d/aws_ct_s3.conf
input {
s3 {
access_key_id => "access_key_id"
bucket => "bucket_name_here"
secret_access_key => "secret_access_key"
prefix => "AWSLogs/<account_number>/CloudTrail/ap-southeast-1/2019/01/09"
sincedb_path => "/tmp/s3ctlogs.sincedb"
region => "us-east-2"
codec => "json"
add_field => { source => gzfiles }
}
}
output {
stdout { codec => json }
elasticsearch {
hosts => ["127.0.0.1:9200"]
index => "attack-%{+YYYY.MM.dd}"
}
}
When logstash is started with the above conf, I can see all working fine. Using the head google chrome plugin, I can see that the documents are continuously getting added to the specified index.In fact when I browse it as well, I can see that there is the data I need. I am able to see the same on the Kibana side too.
The data that each of these gzip files have is of the format:
{
"Records": [
dictionary_D1,
dictionary_D2,
.
.
.
]
}
And I want to have each of these dictionaries from the list of dictionaries above to be a separate event in Kibana. With some Googling around I understand that I could use the split filter to achieve what I want to. And now my aws_ct_s3.conf looks something like :
input {
s3 {
access_key_id => "access_key_id"
bucket => "bucket_name_here"
secret_access_key => "secret_access_key"
prefix => "AWSLogs/<account_number>/CloudTrail/ap-southeast-1/2019/01/09"
sincedb_path => "/tmp/s3ctlogs.sincedb"
region => "us-east-2"
codec => "json"
add_field => { source => gzfiles }
}
}
filter {
split {
field => "Records"
}
}
output {
stdout { codec => json }
elasticsearch {
hosts => ["127.0.0.1:9200"]
index => "attack-%{+YYYY.MM.dd}"
}
}
And with this I am in fact getting the data as I need on Kibana.
Now the problem is
Without the filter in place, the number of documents that were being shipped by Logstash from S3 to Elasticsearch was in GBs, while after applying the filter it has stopped at roughly some 5000 documents alone.
I do not know what am I doing wrong here. Could someone please help ?
Current config:
java -XshowSettings:vm => Max Heap Size: 8.9 GB
elasticsearch jvm options => max and min heap size: 6GB
logstash jvm options => max and min heap size: 2GB
ES version - 6.6.0
LS version - 6.6.0
Kibana version - 6.6.0
This is how the current heap usage looks like:

Related

Logstash: Missing data after migration

I have been migrating one of the indexes in self-hosted Elasticsearch to amazon-elasticsearch using Logstash. we have around 1812 documents in our self-hosted Elasticsearch but in amazon-elasticsearch, we have only about 637 documents. Half of the documents are missing after migration.
Our logstash config file
input {
elasticsearch {
hosts => ["https://staing-example.com:443"]
user => "userName"
password => "password"
index => "testingindex"
size => 100
scroll => "1m"
}
}
filter {
}
output {
amazon_es {
hosts => ["https://example.us-east-1.es.amazonaws.com:443"]
region => "us-east-1"
aws_access_key_id => "access_key_id"
aws_secret_access_key => "access_key_id"
index => "testingindex"
}
stdout{
codec => rubydebug
}
}
We have tried for some of the other indexes as well but it still migrating only half of the documents.
Make sure to compare apples to apples by running GET index/_count on your index on both sides.
You might see more or less documents depending on where you look (Elasticsearch HEAD plugin, Kibana, Cerebro, etc) and if replicas are taken into account in the count or not.
In your case you had more replicas in your local environment than in your AWS Elasticsearch service, hence the different count.

Duplicate field values for grok-parsed data

I have a filebeat that captures logs from uwsgi application running in docker. The data is sent to the logstash which parses it and forwards to elasticsearch.
Here is the logstash conf file:
input {
beats {
port => 5044
}
}
filter {
grok {
match => { "log" => "\[pid: %{NUMBER:worker.pid}\] %{IP:request.ip} \{%{NUMBER:request.vars} vars in %{NUMBER:request.size} bytes} \[%{HTTPDATE:timestamp}] %{URIPROTO:request.method} %{URIPATH:request.endpoint}%{URIPARAM:request.params}? => generated %{NUMBER:response.size} bytes in %{NUMBER:response.time} msecs(?: via sendfile\(\))? \(HTTP/%{NUMBER:request.http_version} %{NUMBER:response.code}\) %{NUMBER:headers} headers in %{NUMBER:response.size} bytes \(%{NUMBER:worker.switches} switches on core %{NUMBER:worker.core}\)" }
}
date {
# 29/Oct/2018:06:50:38 +0700
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z"]
}
kv {
source => "request.params"
field_split => "&?"
target => "request.query"
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200"]
index => "test-index"
}
}
Everything was fine, but I've noticed that all values captured by the grok pattern is duplicated. Here is how it looks in kibana:
Note that the raw data like log which wasn't grok output is fine. I've seen that kv filter has allow_duplicate_values parameter, but it doesn't apply to grok.
What is wrong with my configuration? Also, is it possible to rerun grok patterns on existing data in elasticsearch?
Maybe your filebeat is already doing the job and creating these fields
Did you try to add this parameter to your grok ?
overwrite => [ "request.ip", "request.endpoint", ... ]
In order to rerun grok on already indexed data you need to use elasticsearch input plugin in order to read data from ES and re-index it after grok.

reading .gz files using logstash

I am trying to use logstash 5.5 for analyzing archived (.gz) files generating every minute. Each.gz file contains csv file in it. My .conf file looks like below:
input {
file {
type => "gzip"
path => [ “C:\data*.gz” ]
start_position => "beginning"
sincedb_path=> "gzip"
codec => gzip_lines
}
}
filter {
csv {
separator => ","
columns => [“COL1”,“COL2”,“COL3”,“COL4”,“COL5”,“COL6”,“COL7”]
}
}
output {
elasticsearch {
hosts => "localhost:9200"
index => "mydata"
document_type => “zipdata”
}
stdout {}
}
Initially I was getting error for missing gzip_lines plugin. So, I installed it. After installing this plugin, I can see that logstash says “Succesfully started Logstash API endpoint” but nothing get indexed. I do not see any indexing of data in elasticsearch in logstash logs. When I try to get the index in Kibana, it is not available there. It means that logstash is not putting data in elasticsearch.
May be I am using wrong configuration. Please suggest, what is the correct way of doing this?

Reindexing in Elasticsearch 1.7

There is a problem with our mappings for elasticsearch 1.7. I am fixing the problem by creating a new index with the correct mappings. I understand that since I am creating a new index I will have to reindex from old index with existing data to the new index I have just created. Problem is I have googled around and can't find a way to reindex from old to new. Seems like the reindex API was introduced in ES 2.3 and not supported for 1.7.
My question is how do I reindex my data from old to new after fixing my mappings. Alternatively, what is the best practice for making mapping changes in ES 1.7?
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html will not work for me because we're on an old version of ES (1.7)
https://www.elastic.co/blog/changing-mapping-with-zero-downtime
Initially went down that path but got stuck, need a way to reindex the old to the new
Late for your use case, but wanted to put it out there for others. This is an excellent step-by-step guide on how to reindex an Elasticsearch index using Logstash version 1.5 while maintaining the integrity of the original data: http://david.pilato.fr/blog/2015/05/20/reindex-elasticsearch-with-logstash/
This is the logstash-simple.conf the author creates:
Input {
# We read from the "old" cluster
elasticsearch {
hosts => [ "localhost" ]
port => "9200"
index => "index"
size => 500
scroll => "5m"
docinfo => true
}
}
filter {
mutate {
remove_field => [ "#timestamp", "#version" ]
}
}
output {
# We write to the "new" cluster
elasticsearch {
host => "localhost"
port => "9200"
protocol => "http"
index => "new_index"
index_type => "%{[#metadata][_type]}"
document_id => "%{[#metadata][_id]}"
}
# We print dots to see it in action
stdout {
codec => "dots"
}
There are a few options for you:
use logstash - it's very easy to create a reindex config in logstash and use that to reindex your documents. for example:
input {
elasticsearch {
hosts => [ "localhost" ]
port => "9200"
index => "index1"
size => 1000
scroll => "5m"
docinfo => true
}
}
output {
elasticsearch {
host => "localhost"
port => "9200"
protocol => "http"
index => "index2"
index_type => "%{[#metadata][_type]}"
document_id => "%{[#metadata][_id]}"
}
}
The problem with this approach that it'll be relatively slow since you'll have only a single machine that peforms the reindexing process.
another option, use this tool. It'll be faster than logstash but you'll have to provide a segmentation logic for all your documents to speed up the processing. For example, if you have a numeric fields whose values range from 1 - 100, then you could segment the queries in the tool for, maybe, 10 intervals (1 - 10, 11 - 20, ... 91 - 100), so the tool will spawn up 10 indexers that will work in parallel reindexing your old index.

logstash indexing files multiple times?

I'm using logstash (v2.3.3-1) to index about ~800k documents from S3 to AWS ElasticSearch, and some documents are being indexed 2 or 3 times, instead of only once.
The files are static (nothing is updating them or touching them) and they're very small (each is roughly 1.1KB).
The process takes a very long time to run on a t2.micro (~1day).
The config I'm using is:
input {
s3 {
bucket => "$BUCKETNAME"
codec => "json"
region => "$REGION"
access_key_id => '$KEY'
secret_access_key => '$KEY'
type => 's3'
}
}
filter {
if [type] == "s3" {
metrics {
meter => "events"
add_tag => "metric"
}
}
}
output {
if "metric" in [tags] {
stdout { codec => line {
format => "rate: %{[events][rate_1m]}"
}
}
} else {
amazon_es {
hosts => [$HOST]
region => "$REGION"
index => "$INDEXNAME"
aws_access_key_id => '$KEY'
aws_secret_access_key => '$KEY'
document_type => "$TYPE"
}
stdout { codec => rubydebug }
}
}
I've run this twice now with the same problem (into different ES indices) and the files that are being indexed >1x are different each time.
Any suggestions gratefully received!
The s3 input is very fragile. It records the time of the last file processed, so any files that share the same timestamp will not be processed and multiple logstash instances cannot read from the same bucket. As you've seen, it's also painfully slow to determine which files to process (a good portion of the blame goes to amazon here).
s3 only works for me when I use a single logstash to read the files, and then delete (or backup to another bucket/folder) the files to keep the original bucket as empty as possible, and then setting sincedb_path to /dev/null.

Resources