I started ELK a week back to use it for storing multiple CSVs and getting them to kibana for ease of analysing them. One case will involve multiple machines and one machine will generate many CSVs. Now these CSVs have a particular naming pattern. I am taking one particular file ( BrowsingHistoryView_DMZ-machine1.csv ) for reference and setting up the case as index. To define an index I've chosen to rename files to have prefix of '__case_number __' . So the file name will be- __1__BrowsingHistoryView_DMZ-machine1.csv
Now I want to derive two things out of it.1. Get the case number __1 __ and use 1 as index. 1 , 2 , 3 etc will be used as a case numbers.
2. Get the filetype (BrowsingHistoryView for ex.) and add a tag name to the uploaded file.
3. Get the machine name DMZ-machine1 (don't know yet where I'll use it).
I created a config file for it, which is as below-
file {
path => "/home/kriss/Documents/*.csv" # get the files from Documents
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ","
}
if [path] =~ "BrowsingHistory" { mutate { add_tag => ["Browsinghistory"] } # define a new tag for browser history, this worked
grok { match => ["path", "__(?<case>[0-9]+)__(?<category>\w+_)(?<machine>(.+)).csv"] # This regex pattern is to get category(browsingHistory), MachineName
}
}
if [path] =~ "Inbound_RDP_Events" { mutate { add_tag => {"Artifact" => "RDP" } } }
} # This tagging worked
output {
elasticsearch {
hosts => "localhost"
index => "%{category}" # This referencing the category variable didn't work
}
stdout {}
}
When I run this config on logstash, the index generated is %category . I needed it to capture browser_history for the index of that file. Also if I can convert the category to small letters, since sometimes uppercases don't work well in index. I tried to follow the official documentation but didn't get the complete info that I need.
There's a grok debugger in Dev Tools in Kibana you can use to work on these kinds of problems, or an online one at https://grokdebug.herokuapp.com/ - it's great.
Below is a slightly modified version of your config. I've removed your comments and inserted my own.
The changes are:
The path regex in your config doesn't match the example filename you gave. You might want to change it back, depending on how accurate your example was.
The grok pattern has been tweaked
Changed your Artifact tag to a field, because it looks like you're trying to create a field
I tried to stick to your spacing convention :)
input {
file {
path => "/home/kriss/Documents/*.csv"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ","
}
# I replaced your regex with something matches your example
# filename, but given that you said you already had this
# working, you might want to change it back.
if [path] =~ "browser_history" {
mutate { add_tag => ["Browsinghistory"] }
grok {
# I replaced custom captures with a more grokish style, and
# use GREEDYDATA to capture everything up to the last '_'
match => [ "path", "^_+%{NUMBER:case}_+%{GREEDYDATA:category}_+%{DATA:case}\.csv$" ]
}
}
# Replaced `add_tag` with `add_field` so that the syntax makes sense
if [path] =~ "Inbound_RDP_Events" { mutate { add_field => {"Artifact" => "RDP" } } }
# Added the `mutate` filter's `lowercase` function for "category"
mutate {
lowercase => "category"
}
}
output {
elasticsearch {
hosts => "localhost"
index => "%{category}"
}
stdout {}
}
Not tested, but I hope it gives you enough clues.
So, for the reference for anyone who is trying to use custom variables in logstash config file. Below is the working config-
input {
file {
path => "/home/user1/Documents/__1__BrowsingHistoryView_DMZ-machine1.csv" # Getting the absolte path (necessary)
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ","
}
if [path] =~ "BrowsingHistory" { mutate { add_field => {"artifact" => "Browsinghistory"} } # if BrowsingHistory is found in path, add a tag called Browsinghistory
grok { match => ["path", "__(?<case>[0-9]+)__(?<category>\w+_)(?<machine>(.+)).csv"] # get the caseNumber, logCategory, machineName into variables
}
}
if [path] =~ "Inbound_RDP_Events" { mutate { add_field => {"artifact" => "RDP"} } } # another tag if RDP event file found in path
}
output {
elasticsearch {
hosts => "localhost"
index => "%{case}" # passing the variable value derived from regex
# index => "%{category}" # another regex variable
# index => "%{machine}" # another regex variable
}
stdout {}
}
I wasn't very sure whether to add a new tag or a new field (add_field => {"artifact" => "Browsinghistory"}) for easy identification of a file in kibana. If someone could provide some info on how to chose one out of them.
Related
I am trying to generate various types in the same index based on various csv. As I donĀ“t know the amount of csv, making an input for each one would be non-viable.
So does anyone know how to generate types with the names of the files and in those, introduce the csv respectively?
input {
file {
path => "/home/user/Documents/data/*.csv"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ","
skip_header => "true"
autodetect_column_names => true
autogenerate_column_names => true
}
}
output {
elasticsearch {
hosts => "http://localhost:9200"
index => "final_index"
}
stdout {}
}
Thank you so much
Having multiple document structures in the same index has been removed in Elasticsearch indices since version 6, if a document is not looking the same way as the index is templated it will not be able to send the data to it, what you can do is make sure that all fields are known and you have one general template containing all possible fields.
Is there a reason why you want all of it in one index?
If it is for querying purposes or Kibana, do know you can wildcard when searching and have patterns for Kibana.
Update after your comment:
Use a filter to extract the filename using grok
filter {
grok {
match => ["path","%{GREEDYDATA}/%{GREEDYDATA:filename}\.csv"]
}
}
And use the filename in your output like this:
elasticsearch {
hosts => "http://localhost:9200"
index => "final_index-%{[filename]}"
}
I'm attempting to use Logstash to collect sales data for multiple clients from multiple vendors.
So far i've an S3 (Inbox) bucket that I can drop my files (currently CSVs) into and according the client code prefix on the file, the data gets pushed into an Elastic index for each client. This all works nicely.
The problem I have is that I have data from multiple vendors and need a way of identifying which file is from which vendor. Adding an extra column to the CSV is not an option, so my plan was to add this to the filename, so i'd end up with a filenaming convention something like clientcode_vendorcode_reportdate.csv.
I may be missing something, but it seems that on S3 I can't get access to the filename inside my config, which seems crazy given that the prefix is clearly being read at some point. I was intending to use a Grok or Ruby filter to simply split the filename on the underscore, giving me three key variables that I can use in my config, but all attempts so far have failed. I can't even seem to get the full source_path or filename as a variable.
My config so far looks something like this - i've removed failed attempts at using Grok/Ruby filters.
input {
s3 {
access_key_id =>"MYACCESSKEYID"
bucket => "sales-inbox"
backup_to_bucket => "sales-archive"
delete => "true"
secret_access_key => "MYSECRETACCESSKEY"
region => "eu-west-1"
codec => plain
prefix => "ABC"
type => "ABC"
}
s3 {
access_key_id =>"MYACCESSKEYID"
bucket => "sales-inbox"
backup_to_bucket => "sales-archive"
delete => "true"
secret_access_key => "MYSECRETACCESSKEY"
region => "eu-west-1"
codec => plain
prefix => "DEF"
type => "DEF"
}
}
filter {
if [type] == "ABC" {
csv {
columns => ["Date","Units","ProductID","Country"]
separator => ","
}
mutate {
add_field => { "client_code" => "abc" }
}
}
else if [type] == "DEF" {
csv {
columns => ["Date","Units","ProductID","Country"]
separator => ","
}
mutate {
add_field => { "client_code" => "def" }
}
}
mutate
{
remove_field => [ "message" ]
}
}
output {
elasticsearch {
codec => json
hosts => "myelasticcluster.com:9200"
index => "sales_%{client_code}"
document_type => "sale"
}
stdout { codec => rubydebug }
}
An guidance from those well versed in Logstash configs would be much appreciated!
Basic is a float field. The mentioned index is not present in elasticsearch. When running the config file with logstash -f, I am getting no exception. Yet, the data reflected and entered in elasticsearch shows the mapping of Basic as string. How do I rectify this? And how do I do this for multiple fields?
input {
file {
path => "/home/sagnik/work/logstash-1.4.2/bin/promosms_dec15.csv"
type => "promosms_dec15"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
grok{
match => [
"Basic", " %{NUMBER:Basic:float}"
]
}
csv {
columns => ["Generation_Date","Basic"]
separator => ","
}
ruby {
code => "event['Generation_Date'] = Date.parse(event['Generation_Date']);"
}
}
output {
elasticsearch {
action => "index"
host => "localhost"
index => "promosms-%{+dd.MM.YYYY}"
workers => 1
}
}
You have two problems. First, your grok filter is listed prior to the csv filter and because filters are applied in order there won't be a "Basic" field to convert when the grok filter is applied.
Secondly, unless you explicitly allow it, grok won't overwrite existing fields. In other words,
grok{
match => [
"Basic", " %{NUMBER:Basic:float}"
]
}
will always be a no-op. Either specify overwrite => ["Basic"] or, preferably, use mutate's type conversion feature:
mutate {
convert => ["Basic", "float"]
}
I have AWS ElasticBeanstalk instance logs on S3 bucket.
Path to Logs is:
resources/environments/logs/publish/e-3ykfgdfgmp8/i-cf216955/_var_log_nginx_rotated_access.log1417633261.gz
which translates to :
resources/environments/logs/publish/e-[random environment id]/i-[random instance id]/
The path contains multiple logs:
_var_log_eb-docker_containers_eb-current-app_rotated_application.log1417586461.gz
_var_log_eb-docker_containers_eb-current-app_rotated_application.log1417597261.gz
_var_log_rotated_docker1417579261.gz
_var_log_rotated_docker1417582862.gz
_var_log_rotated_docker-events.log1417579261.gz
_var_log_nginx_rotated_access.log1417633261.gz
Notice that there's some random number (timestamp?) inserted by AWS in filename before ".gz"
Problem is that I need to set variables depending on log file name.
Here's my configuration:
input {
s3 {
debug => "true"
bucket => "elasticbeanstalk-us-east-1-something"
region => "us-east-1"
region_endpoint => "us-east-1"
credentials => ["..."]
prefix => "resources/environments/logs/publish/"
sincedb_path => "/tmp/s3.sincedb"
backup_to_dir => "/tmp/logstashed/"
tags => ["s3","elastic_beanstalk"]
type => "elastic_beanstalk"
}
}
filter {
if [type] == "elastic_beanstalk" {
grok {
match => [ "#source_path", "resources/environments/logs/publish/%{environment}/%{instance}/%{file}<unnecessary_number>.gz" ]
}
}
}
In this case I want to extract environment , instance and file name from path. In file name I need to ignore that random number.
Am I doing this the right way? What will be full, correct solution for this?
Another question is how can I specify fields for custom log format for particular log file from above?
This could be something like: (meta-code)
filter {
if [type] == "elastic_beanstalk" {
if [file_name] BEGINS WITH "application_custom_log" {
grok {
match => [ "message", "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" ]
}
}
if [file_name] BEGINS WITH "some_other_custom_log" {
....
}
}
}
How do I test for file name pattern?
For your first question, and assuming that #source_path contains the full path, try:
match => [ "#source_path", "logs/publish/%{NOTSPACE:env}/%{NOTSPACE:instance}/%{NOTSPACE:file}%{NUMBER}%{NOTSPACE:suffix}" ]
This will create 4 logstash field for you:
env
instance
file
suffix
More information is available on the grok man page and you should test with the grok debugger.
To test fields in logstash, you use conditionals, e.g.
if [field] == "value"
if [field] =~ /regexp/
etc.
Note that it's not always necessary to do this with grok. You can have multiple 'match' arguments, and it will (by default) stop after hitting the first one that matches. If your patterns are exclusive, this should work for you.
i have the following json in a file-
{
"foo":"bar",
"spam" : "eggs"
},
{
"css":"ddq",
"eeqw": "fewq"
}
and the following conf file-
input {
file
{
path => "/opt/logstash-1.4.2/bin/sam.json"
type => "json"
codec => json_lines
start_position =>"beginning"
}
}
output { stdout { codec => json } }
but when i run
./logstash -f sample.conf
i don't get any output in stdout.
but when i don't give json as codec and give type => "core2" then it seems to work.
Anyone know how i can fix it to work for json type.
The other issue is it gives me the following output when it does give stdout-
{"message":"{","#version":"1","#timestamp":"2015-07-15T02:02:02.653Z","type":"core2","host":"sjagannath","path":"/opt/logstash-1.4.2/bin/sam.json"}{"message":"\"foo\":\"bar\", ","#version":"1","#timestamp":"2015-07-15T02:02:02.654Z","type":"core2","host":"sjagannath","path":"/opt/logstash-1.4.2/bin/sam.json"}{"message":"\"spam\" : \"eggs\" ","#version":"1","#timestamp":"2015-07-15T02:02:02.655Z","type":"core2","host":"sjagannath","path":"/opt/logstash-1.4.2/bin/sam.json"}{"message":"},","#version":"1","#timestamp":"2015-07-15T02:02:02.655Z","type":"core2","host":"sjagannath","path":"/opt/logstash-1.4.2/bin/sam.json"}{"message":"{ ","#version":"1","#timestamp":"2015-07-15T02:02:02.655Z","type":"core2","host":"sjagannath","path":"/opt/logstash-1.4.2/bin/sam.json"}{"message":"\"css\":\"ddq\", ","#version":"1","#timestamp":"2015-07-15T02:02:02.656Z","type":"core2","host":"sjagannath","path":"/opt/logstash-1.4.2/bin/sam.json"}{"message":"\"eeqw\": \"fewq\"","#version":"1","#timestamp":"2015-07-15T02:02:02.656Z","type":"core2","host":"sjagannath","path":"/opt/logstash-1.4.2/bin/sam.json"}{"message":"}","#version":"1","#timestamp":"2015-07-15T02:02:02.656Z","type":"core2","host":"sjagannath","path":"/opt/logstash-1.4.2/bin/sam.json"}{"message":"","#version":"1","#timestamp":"2015-07-15T02:02:02.656Z","type":"core2","host":"sjagannath","path":"/opt/logstash-1.4.2/bin/sam.json"}
I want to know how it can be parsed the right way with the key value pairs in my input file
I found this and edited it to suit your purpose. The following config should do exactly what you want:
input {
file {
codec => multiline
{
pattern => "^\}"
negate => true
what => previous
}
path => ["/absoute_path/json.json"]
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
mutate {
replace => [ "message", "%{message}}" ]
gsub => [ "message","\n",""]
gsub => [ "message","},",""]
}
if [message] =~ /^{.*}$/ {
json { source => message }
}
}
I tried your given json and it results in two events. First with foo = bar and spam = eggs. Second with css = ddq and eeqw = fewq.
As of my understanding you want to put your complete son document on one line if you want to use the json_lines codec:
{"foo":"bar","spam" : "eggs"}
{"css":"ddq","eeqw": "fewq"}
In your case you have a problem with the structure since you also have a ',' between the son objects. Not the most easy way to handle it. SO if possible change the source to my example. If that is not possible the multiline approach might help you. Check this for reference:
input json to logstash - config issues?