Fluentbit Kubernetes - How to extract fields from existing logs

Fluentbit Kubernetes - How to extract fields from existing logs - elasticsearch

I have configured EFK stack with Fluent-bit on my Kubernetes cluster. I can see the logs in Kibana.
I also have deployed nginx pod, I can see the logs of this nginx pod also in Kibana. But all the log data are sent to a single field "log" as shown below.
How can I extract each field into a separate field. There is a solution for fluentd already in this question. Kibana - How to extract fields from existing Kubernetes logs
But how can I achieve the same with fluent-bit?
I have tried the below by adding one more FILTER section under the default FILTER section for Kubernetes, but it didn't work.
[FILTER]
Name parser
Match kube.*
Key_name log
Parser nginx
From this (https://github.com/fluent/fluent-bit/issues/723), I can see there is no grok support for fluent-bit.

In our official documentation for Kubernetes filter we have an example about how to make your Pod suggest a parser for your data based in an annotation:
https://docs.fluentbit.io/manual/filter/kubernetes

Look at this configmap:
https://github.com/fluent/fluent-bit-kubernetes-logging/blob/master/output/elasticsearch/fluent-bit-configmap.yaml
The nginx parser should be there:
[PARSER]
Name nginx
Format regex
Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z

Related

Transform String into JSON so that it's searchable in Kibana/Elasticsearch

I have Elasticsearch, Filebeat and Kibana running on a Windows machine. Filebeat log has a proper log file and is listening to the path. When I look on the data in Kibana it looks fine.
My issue is that the message field is a String.
Example of one log line:
12:58:09.9608 Trace {"message":"No more Excel rows found","level":"Trace","logType":"User","timeStamp":"2020-08-14T12:58:09.9608349+02:00","fingerprint":"226fdd2-e56a-4af4-a7ff-724a1a0fea24","windowsIdentity":"mine","machineName":"NAME-PC","processName":"name","processVersion":"1.0.0.1","jobId":"957ef018-0a14-49d2-8c95-2754479bb8dd","robotName":"NAME-PC","machineId":6,"organizationUnitId":1,"fileName":"GetTransactionData"}
So what I would like to have now is that String converted to a JSON so that it is possible to search in Kibana for example for the level field.
I already had a look on Filebeat. There I tried to enable LogStash . But then the data does not come anymore to Elasticsearch. And also the log file is not genereated into the LogStash folder.
Then I downloaded LogStash via install guide, but unfortunately I got this message:
C:\Users\name\Desktop\logstash-7.8.1\bin>logstash.bat
Sending
Logstash logs to C:/Users/mine/Desktop/logstash-7.8.1/logs which
is now configured via log4j2.properties ERROR: Pipelines YAML file is
empty. Location:
C:/Users/mine/Desktop/logstash-7.8.1/config/pipelines.yml usage:
bin/logstash -f CONFIG_PATH [-t] [-r] [] [-w COUNT] [-l LOG]
bin/logstash --modules MODULE_NAME [-M
"MODULE_NAME.var.PLUGIN_TYPE.PLUGIN_NAME.VARIABLE_NAME=VALUE"] [-t]
[-w COUNT] [-l LOG] bin/logstash -e CONFIG_STR [-t] [--log.level
fatal|error|warn|info|debug|trace] [-w COUNT] [-l LOG] bin/logstash
-i SHELL [--log.level fatal|error|warn|info|debug|trace] bin/logstash -V [--log.level fatal|error|warn|info|debug|trace]
bin/logstash --help
[2020-08-14T15:07:51,696][ERROR][org.logstash.Logstash ]
java.lang.IllegalStateException: Logstash stopped processing because
of an error: (SystemExit) exit
Edit:
I tried to use Filebeat only. Here I set:
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
- add_docker_metadata: ~
- add_kubernetes_metadata: ~
- dissect:
tokenizer: '"%{event_time} %{loglevel} %{json_message}"'
field: "message"
target_prefix: "dissect"
- decode_json_fields:
fields: ["json_message"]
but that gave me:
dissect_parsing_error
The tip with removing the "" at tokenizer helped. Then I got:
I simply refreshed the index and the message was gone. Nice.
But The question is now, how to filter for something in the new field?

The message says, your pipeline config is empty. It seems you did not configured any pipeline yet. Logstash can do the trick (JSON filter plugin), but Filebeat is sufficient here. If you don't want to introduce another Service, this is the better option.
It has the decode_json_fields option to transform specific fields containing JSON in your event to a . Here is the documentation.
For the future case, where your whole event is a JSON, there is the possibility of parsing in filebeat configuring the json.message_key and related json.* option.
EDIT - Added filebeat snippet as an processors example of dissecting the log line into three fields (event_time, loglevel, json_message). Afterwards the recently extracted field json_message, whose value is a JSON object encoded as a string, will be decoded into an JSON structure:
...
filebeat.inputs:
- type: log
paths:
- path to your logfile
processors:
- dissect:
tokenizer: '%{event_time} %{loglevel} %{json_message}'
field: "message"
target_prefix: "dissect"
- decode_json_fields:
fields: ["dissect.json_message"]
target: ""
- drop_fields:
fields: ["dissect.json_message"]
...
If you want to practice the filebeat processors, try to set the correct event timestamp, taken from the encoded json and written into #timestamp using the timestamp processor.

Include fluentd time into json post data

td-agent.config
<match test>
type webhdfs
host localhost
port 50070
path /test/%Y%m%d_%H
username hdfs
output_include_tag false
remove_prefix test
time_format %Y-%m-%d %H:%M:%S
output_include_time true
format json
localtime
buffer_type file
buffer_path /test/test
buffer_chunk_limit 4m
buffer_queue_limit 50
flush_interval 3s
</match>
In hdfs log file it show as below:
2016-02-22 16:04:15 {"login_id":123,"email":"abcd#gmail.com"}
Have any way to embed the fluentd time field not the client time into json data before store in file such as:
{"time_key":"2016-02-22 16:04:15","login_id":123,"email":"abcd#gmail.com"}

I have the solution :
Use plugin https://github.com/repeatedly/fluent-plugin-record-modifier
Add the field time and then push to hdfs
:)

How to forward a JSON file with FluentD to Graylog2 with a valid time format

I am working on logging with FluentD and Graylog GELF with limited success. I want to forward a JSON file:
<source>
#type tail
path /var/log/suricata/eve.json
pos_file /var/log/td-agent/suri_eve.pos # pos record
tag ids
format json
# JSON time stamp: 2016-02-01T11:52:49.157072+0000
# this timestamp is ruby's t.strftime("%Y-%m-%dT%H:%M:%S.%6N%z")
time_format %Y-%m-%dT%H:%M:%S.%6N%z
time_key timestamp # I show a JSON message below
</source>
<match **>
#type graylog
host 1.2.3.4 #(optional; default="localhost")
port 12201 #(optional; default=9200)
flush_interval 30
num_threads 2
</match>
This kicks in, but produces error messages:
2016-02-01 15:30:11 +0000 [warn]: plugin/in_tail.rb:263:rescue in
convert_line_to_event:
"{\"timestamp\":\"2016-02-01T15:27:09.000087+0000\",\"flow_id\":51921072,\"event_type\":\"flow\",\"src_ip\":\"10.1.1.85\",\"src_port\":59820,\"dest_ip\":\"224.0.0.252\",\"dest_port\":5355,\"proto\":\"UDP\",\"flow\":{\"pkts_toserver\":4,\"pkts_toclient\":0,\"bytes_toserver\":294,\"bytes_toclient\":0,\"start\":\"2016-02-01T15:26:30.393371+0000\",\"end\":\"2016-02-01T15:26:37.670904+0000\",\"age\":7,\"state\":\"new\",\"reason\":\"timeout\"}}" error="invalid time format: value = 2016-02-01T15:27:09.000087+0000,
error_class = ArgumentError, error = invalid strptime format -
`%Y-%m-%dT%H:%M:%S.%6N%z'"
An original messages looks like this:
{"timestamp":"2016-02-01T15:31:02.000699+0000","flow_id":52015920,"event_type":"flow","src_ip":"10.1.1.44","src_port":49313,"dest_ip":"224.0.0.252","dest_port":5355,"proto":"UDP","flow":{"pkts_toserver":2,"pkts_toclient":0,"bytes_toserver":128,"bytes_toclient":0,"start":"2016-02-01T15:30:31.348568+0000","end":"2016-02-01T15:30:31.759024+0000","age":0,"state":"new","reason":"timeout"}}
So I checked the Ruby docs. I am not too familiar with FluentD but from what I know the time format expression should fit? I tried format=none but that also doesn't work.

https://github.com/Graylog2/graylog2-server/issues/1761
This is a bug/problem with reserved fields (undocumented) in Graylog2.
If you find a similar bug with timestamps, check the linked issue and the dev response.

Using id_key with fluentd/elasticsearch

I recently started attempting to use the fluentd + elasticsearch + kibana setup.
I'm currently feeding information through fluentd by having it read a log file I'm spitting out with python code.
The log is made out of a list of json data, one per line, like so:
{"id": "1","date": "2014-02-01T09:09:59.000+09:00","protocol": "tcp","source ip": "xxxx.xxxx.xxxx.xxxx","source port": "37605","country": "CN","organization": "China Telecom jiangsu","dest ip": "xxxx.xxxx.xxxx.xxxx","dest port": "23"}
I have the fluentd set-up to read my field "id" and fill out "_id", as per instructions here:
<source>
type tail
path /home/(usr)/bin1/fluentd.log
tag es
format json
keys id, date, prot, srcip, srcport, country, org, dstip, dstport
id_key id
time_key date
time_format %Y-%m-%dT%H:%M:%S.%L%:z
</source>
<match es.**>
type elasticsearch
logstash_format true
flush_interval 10s # for testing
</match>
However, the "_id" after inserting the above still comes out to be the randomly generated _id.
If anyone could point out to me what I'm doing wrong, I would much appreciate it.

id_key id should be in inside <match es.**>, not <source>.
<source> is for input plugin, tail in this case.
<match> is for output plugin, elasticsearch in this case.
So elasticsearch configuration should be set in <match>.
http://docs.fluentd.org/articles/config-file

How to fix can not find dir in pathToPartitionInfo (selecting/counting from HIVE table on EMR)

I've set up an interactive hive session and load apache weblog date into a table directly from an s3 bucket:
DROP TABLE apachelog;
CREATE EXTERNAL TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE
LOCATION 's3n://OperationOverkill/';
I can then sucessfully select from it like so:
SELECT * FROM apachelog LIMIT 5;
But counting (or anything requiering actual map-reduce does not:
SELECT COUNT(host) FROM apachelog;
The error message:
Job Submission failed with exception 'java.io.IOException(cannot find dir = s3n: //OperationOverkill/access_clickkiller_12-08-08.log in pathToPartitionInfo: s3n ://OperationOverkill/)'
I google and found a similar question on AWS Support forum but I hope for quicker pointers/help from SO.

I ran into the same problem, but using a subdirectory in s3 fixed it. So, I would try putting your files in something like "s3n://OperationOverkill/subdir/" and using that.

This seems to be a Bug, It may work for S3 if you do not use bucket root, but I could not get to work for HDFS (Something like hdfs:///path/to/folder/):
https://issues.apache.org/jira/browse/HIVE-7774
Since bug is fixed in Hive 0.14, you need to use version higher than that.
In context of AWS,
If you are following sample code from here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/calling-emr-with-java-sdk.html
You are probably using Hive 0.13.*:
See this:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_SupportedHiveVersions.html
You can instead follow this link, which shows new way of creating cluster with new AMI version:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-4.5.0/emr-release-differences.html#emr-release-label
For "release-lable" 4 or 5, you can get new Hive version, which should resolve the issue, see this for EMR(AMI) to hive version mapping:
http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-release-components.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Fluentbit Kubernetes - How to extract fields from existing logs - elasticsearch

In our official documentation for Kubernetes filter we have an example about how to make your Pod suggest a parser for your data based in an annotation: https://docs.fluentbit.io/manual/filter/kubernetes

Related

Transform String into JSON so that it's searchable in Kibana/Elasticsearch

Include fluentd time into json post data

How to forward a JSON file with FluentD to Graylog2 with a valid time format

Using id_key with fluentd/elasticsearch

How to fix can not find dir in pathToPartitionInfo (selecting/counting from HIVE table on EMR)

Categories

Resources