Optimizing dictionary translation in logstash - elasticsearch

I am using logstash to parse data from a csv file and push it to elasticsearch. I have a dictionary with 600k lines, which uses one of the fields as a key to map it to a string of values. I am currently using the translate plugin like this to achieve what I need
filter {
translate {
dictionary_path => "somepath"
field => "myfield"
override => false
destination => "destinationField"
}
}
I get a comma separated String in my destinationField which I read using
filter{
csv {
source => "destinationField"
columns => ["col1","col2","col3"]
separator => ","
}
}
The result of adding these 2 blocks has increased my processing time by 3x. If it used to take 1 min to process and push all my data, it is now taking 3min to complete the task.
Is this the expected behavior (It is a large dictionary)? Or is there any way to further optimize this code?

The csv filter can be expensive. I wrote a plugin logstash-filter-augment that works nearly identically to translate but handles a native CSV document better. You can use a real CSV rather than csv filter to parse a field.

Related

Some of KV filter values has custom date that identified as string in Kibana

I'm using kv filter in Logstash to process config file in the following format :
key1=val1
key2=val2
key3=2020-12-22-2150
with the following lines in Logstash :
kv {
field_split => "\r\n"
value_split => "="
source => "message"
}
Some of my fields in the conf file have a the following date format : YYYY-MM-DD-HHMMSS. When Logstash send the fields to ES, Kibana display them as strings. How can I let Logstash know that those fields are date fields and by that indexing them in ES as dates and not strings ?
I don't want to edit the mapping of the index because it will require reindexing. My final goal with those fields is to calculate the diff between the fields (in seconds, minutes,hours..) and display it in Kibana.
The idea that I have :
Iterate over k,v filter results, if the value is of format YYYY-MM-DD-HHMMSS (check with regex)
In this case, chance the value of the field to milliseconds since epoch
I decided to use k,v filter and Ruby code as a solution but I'm facing an issue.
It could be done more easily outside of logstash by adding a dynamic_template on your index and let him manage field types.
You can use the field name as a detector if it is clear enough (*_date) or define a regex
"match_pattern": "regex",
"match": "^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$"
The code above hasnot been tested.
You can find the official doc here.
https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html
My solution :
I used the kv filter to convert each line into key value set.
I saved the kv filter resut into a dedicated field.
On this dedicated field, I run a Ruby script that changed all the dates with the custom format to miliseconds since epoch.
code :
filter {
if "kv_file" in [tags] {
kv {
field_split => "\r\n"
value_split => "="
source => "message"
target => "config_file"
}
ruby {
id => "kv_ruby"
code => "
require 'date'
re = /([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])-[0-23]{2}[0-5]{1}[0-9]{1}[0-5]{1}[0-9]{1})/
hash = event.get('config_file').to_hash
hash.each { |key,value|
if value =~ re
date_epochs_milliseconds = DateTime.strptime(value,'%F-%H%M%S').strftime('%Q')
event.set(key, date_epochs_milliseconds.to_i)
end
}
"
}
}
}
By the way, if you are facing the following error in your Ruby compilation : (ruby filter code):6: syntax error, unexpected null hash it doesn't actually mean that you got a null value, it seems that it is related to the escape character of the double quotes. Just try to replace double quotes with one quote.

Searching or filtering on a multi value field

I have a MySQL database with a table that contains 2 importants fields title and age_range.
That table saves documents like this '45;60' for documents designed for users between 45 and 60 years old, '18;70' for users between 18 and 70 years old and so on...
Now I would like to fire the query 'test' on the field title with the filter '18;50' for the field age_range that will return all documents matching 'test' with the age range field contained in this interval including the 2 cases above for example.
For instance, I use Logstash to index my data.
How can I achieve this?
Any treatment to do while indexing my data with logstash?
Any filter, tokenizer to use while indexing using ES analyzer?
Thank you in advance
You can split the data as two fields with grok filter. To ship data to Elasticsearch, you can use logstash jdbc_streaming input and elasticsearch output firstly. And you configure your input like below:
input {
jdbc_streaming {
# Configuration of jdbc
# https://www.elastic.co/guide/en/logstash/current/plugins-filters-jdbc_streaming.html
}
}
filter {
# Split the field as separeted two fields
grok {
match => { "age_range" => "%{NUMBER:age_range_top};%{NUMBER:age_range_bottom}" }
}
}
output {
elasticsearch {
# elasticsearch output configuration
}
}
Analysis depends on your search method. How can you want to search these fields. Range fields is necessary default one if you want to do only range filter. But you should do some work about title. For example you can follow this example to handle autocomplete.

Logstash doc_as_upsert cross index in Elasticsearch to eliminate duplicates

I have a logstash configuration that uses the following in the output block in an attempt to mitigate duplicates.
output {
if [type] == "usage" {
elasticsearch {
hosts => ["elastic4:9204"]
index => "usage-%{+YYYY-MM-dd-HH}"
document_id => "%{[#metadata][fingerprint]}"
action => "update"
doc_as_upsert => true
}
}
}
The fingerprint is calculated from a SHA1 hash of two unique fields.
This works when logstash sees the same doc in the same index, but since the command that generates the input data doesn't have a reliable rate at which different documents appear, logstash will sometimes insert duplicates docs in a different date stamped index.
For example, the command that logstash runs to get the input generally returns the last two hours of data. However, since I can't definitively tell when a doc will appear/disappear, I tun the command every fifteen minutes.
This is fine when the duplicates occur within the same hour. However, when the hour or day date stamp rolls over, and the document still appears, elastic/logstash thinks it's a new doc.
Is there a way to make the upsert work cross index? These would all be the same type of doc, they would simply apply to every index that matches "usage-*"
A new index is an entirely new keyspace and there's no way to tell ES to not index two documents with the same ID in two different indices.
However, you could prevent this by adding an elasticsearch filter to your pipeline which would look up the document in all indices and if it finds one, it could drop the event.
Something like this would do (note that usages would be an alias spanning all usage-* indices):
filter {
elasticsearch {
hosts => ["elastic4:9204"]
index => "usages"
query => "_id:%{[#metadata][fingerprint]}"
fields => {"_id" => "other_id"}
}
# if the document was found, drop this one
if [other_id] {
drop {}
}
}

Logstash Filter for a custom message

I am trying to parse a bunch of strings in Logstash and output is set as ElasticSearch.
Sample input string is: 2016 May 24 10:20:15 User1 CREATE "Create a new folder"
The grok filter is:
match => { "message" => "%{SYSLOGTIMESTAMP:timestamp} %{WORD:user} %{WORD:action_performed} %{WORD:action_description} "}
In Elasticsearch, I am not able to see separate columns for different field such as timstamp, user, action_performed etc.
Instead the whole string is under a single column "message".
I would like to store the information in separate fields instead of just a single column.
Not sure what to change in logstash filter to achieve as desired.
Thanks!
You need to change your grok pattern with this, i.e. use QUOTEDSTRING instead of WORD and it will work!
match => { "message" => "%{SYSLOGTIMESTAMP:timestamp} %{WORD:user} %{WORD:action_performed} %{QUOTEDSTRING:action_description}"}

JSON parser in logstash ignoring data?

I've been at this a while now, and I feel like the JSON filter in logstash is removing data for me. I originally followed the tutorial from https://www.digitalocean.com/community/tutorials/how-to-install-elasticsearch-logstash-and-kibana-elk-stack-on-ubuntu-14-04
I've made some changes, but it's mostly the same. My grok filter looks like this:
uuid #uuid and fingerprint to avoid duplicates
{
target => "#uuid"
overwrite => true
}
fingerprint
{
key => "78787878"
concatenate_sources => true
}
grok #Get device name from the name of the log
{
match => { "source" => "%{GREEDYDATA}%{IPV4:DEVICENAME}%{GREEDYDATA}" }
}
grok #get all the other data from the log
{
match => { "message" => "%{NUMBER:unixTime}..." }
}
date #Set the unix times to proper times.
{
match => [ "unixTime","UNIX" ]
target => "TIMESTAMP"
}
grok #Split up the message if it can
{
match => { "MSG_FULL" => "%{WORD:MSG_START}%{SPACE}%{GREEDYDATA:MSG_END}" }
}
json
{
source => "MSG_END"
target => "JSON"
}
So the bit causing problems is the bottom, I think. My gork stuff should all be correct. When I run this config, I see everything in kibana displayed correctly, except for all the logs which would have JSON code in them (not all of the logs have JSON). When I run it again without the JSON filter it displays everything.
I've tried to use a IF statement so that it only runs the JSON filter if it contains JSON code, but that didn't solve anything.
However, when I added a IF statement to only run a specific JSON format (So, if MSG_START = x, y or z then MSG_END will have a different json format. In this case lets say I'm only parsing the z format), then in kibana I would see all the logs that contain x and y JSON format (not parsed though), but it won't show z. So i'm sure it must be something to do with how I'm using the JSON filter.
Also, whenever I want to test with new data I started clearing old data in elasticsearch so that if it works I know it's my logstash that's working and not just running of memory from elasticsearch. I've done this using XDELETE 'http://localhost:9200/logstash-*/'. But logstash won't make new indexes in elasticsearch unless I provide filebeat with new logs. I don't know if this is another problem or not, just thought I should mention it.
I hope that all makes sense.
EDIT: I just check the logstash.stdout file, it turns out it is parsing the json, but it's only showing things with "_jsonparsefailure" in kibana so something must be going wrong with Elastisearch. Maybe. I don't know, just brainstorming :)
SAMPLE LOGS:
1452470936.88 1448975468.00 1 7 mfd_status 000E91DCB5A2 load {"up":[38,1.66,0.40,0.13],"mem":[967364,584900,3596,116772],"cpu":[1299,812,1791,3157,480,144],"cpu_dvfs":[996,1589,792,871,396,1320],"cpu_op":[996,50]}
MSG_START is load, MSG_END is everything after in the above example, so MSG_END is valid JSON that I want to parse.
The log bellow has no JSON in it, but my logstash will try to parse everything after "Inf:" and send out a "_jsonparsefailure".
1452470931.56 1448975463.00 1 6 rc.app 02:11:03.301 Inf: NOSApp: UpdateSplashScreen not implemented on this platform
Also this is my output in logstash, since I feel like that is important now:
elasticsearch
{
hosts => ["localhost:9200"]
document_id => "%{fingerprint}"
}
stdout { codec => rubydebug }
I experienced a similar issue and found that some of my logs were using a UTC time/date stamp and others were not.
Fixed the code to use exclusively UTC and sorted the issue for me.
I asked this question: Logstash output from json parser not being sent to elasticsearch
later on, and it has more relevant information on it, maybe a better answer if anyone ever has a similar problem to me you can check out that link.

Resources