logstash: how to include input file line number - elasticsearch

I am trying to create a way to navigate my log files and the main features I need are:
search for strings inside log file (and returning line of occurrences).
pagination from line x to line y.
Now I was checking Logstash and it was looking great for my first feature (searching), but not so much for the second one. I was under the idea that I could somehow index the file line number along with the log information of each record, but I can't seem to find a way.
Is there somehow a Logstash Filter to do this? or a Filebeat processor? I can't make it work.
I was thinking that maybe I could create a way for all my processes to log into a database with processed information, but that's also kind of impossible (or very difficult) because the Log Handler also doesn't know what's the current log line.
At the end what I could do is, for serving a way to paginate my log file (through a service) would be to actually open it, navigate to a specific line and show it in a service which is not very optimal, as the file could be very big, and I am already indexing it into Elasticsearch (with Logstash).
My current configuration is very simple:
Filebeat
filebeat.prospectors:
- type: log
paths:
- /path/of/logs/*.log
output.logstash:
hosts: ["localhost:5044"]
Logstash
input {
beats {
port => "5044"
}
}
output {
elasticsearch {
hosts => [ "localhost:9200" ]
}
}
Right now for example I am getting an item like:
{
"beat": {
"hostname": "my.local",
"name": "my.local",
"version": "6.2.2"
},
"#timestamp": "2018-02-26T04:25:16.832Z",
"host": "my.local",
"tags": [
"beats_input_codec_plain_applied",
],
"prospector": {
"type": "log"
},
"#version": "1",
"message": "2018-02-25 22:37:55 [mylibrary] INFO: this is an example log line",
"source": "/path/of/logs/example.log",
"offset": 1124
}
If I could somehow include into that item a field like line_number: 1, would be great as I could use Elasticsearch filters to actually navigate through the whole logs.
If you guys have ideas for different ways to store my logs (and navigate) please also let me know

Are the log files generated by you? Or can you change the log structure? Then you can add a counter as a prefix and filter it out with logstash.
For example for
12345 2018-02-25 22:37:55 [mylibrary] INFO: this is an example log line
your filter must look like this:
filter {
grok {
match => {"message" => "%{INT:count} %{GREEDYDATA:message}"
overwrite => ["message"]
}
}
New field "count" will be created. You can then possibly use it for your purposes.

At this moment, I don't think there are any solutions here. Logstash, Beats, Kibana all have the idea of events over time and that's basically the way things are ordered. Line numbers are more of a text editor kind of functionality.
To a certain degree Kibana can show you the events in a file. It won't give you a page by page kind of list where you can actually click on a page number, but using time frames you could theoretically look at an entire file.
There are similar requests (enhancements) for Beats and Logstash.

First let me give what is probably the main reason why Filebeat doesn't already have a line number field. When Filebeat resumes reading a file (like after a restart) it does an fseek to resume from the last recorded offset. If it had to report the line numbers it would either need to store this state in its registry or re-read the file and count newlines up to the offset.
If you want to offer a service that allows you to paginate through the logs that are backed by Elasticsearch you can use the scroll API with a query for the file. You must sort the results by #timestamp and then by offset. Your service would use a scroll query to get the first page of results.
POST /filebeat-*/_search?scroll=1m
{
"size": 10,
"query": {
"match": {
"source": "/var/log/messages"
}
},
"sort": [
{
"#timestamp": {
"order": "asc"
}
},
{
"offset": "asc"
}
]
}
Then to get all future pages you use the scroll_id returned from the first query.
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBwAAAAAAPXDOFk12OEYw="
}
This will give you all log data for a given file name even tracking it across rotations. If line numbers are critical you could produce them synthetically by counting events starting with the first event that has offset == 0, but I avoid this because it's very error prone especially if you ever add any filtering or multiline grouping.

Related

Logstash ignore document update if document time field in log is older than current time field in document

I am using logstash to process a log file. One of the fields in my log file is of type Date and has this format: yyyyMMddHHmmssSSS
I read each line of my log file into a document in an index in Elasticsearch. A sample line from my log file looks like this:
{"location":"Earth","sku":"0000000","quantity":"5","time":"20180813124704961"}
Which in turn my document structure in my index looks like so:
{
"_source": {
"sku": "0000000",
"time": "20180813124704961",
"location": "Chicago",
"quantity": 5
}
}
My logs are constantly updating and I want to prevent my index from having stale data. How can I check against the time field in my index to see if it is older or newer than the same line the next time the log file gets processed?
For example, if the time field in the same line of the log file were to change to be older the time above, then the document should not be updated, BUT, if the time field value was newer, then it should be updated.
Here is what I have tried (logstash.conf):
elasticsearch {
hosts => "http://localhost:9200"
index => "logstash"
scripted_upsert => true
script => "if(ctx.op == create || params.event.get('time').compareTo(ctx._source.time) > 0) ctx._source = params.event"
}
Many thanks in advance.
I am posting an answer for those who may also stumble across a similar problem.
scripted_upsert => true
action => "update"
script_lang => "painless"
script_type => "inline"
script => "if(ctx.op == 'create' || params.event.time.compareTo(ctx._source.time) > 0) ctx._source = params.event;"
The documentation is not super clear on this, but you can access fields from the document directly by stepping into the json via params.event.YOUR_FIELD.compareTo...and then you can do whatever script you like with your data.
Event is the default variable name but you get to event via params

How to treat certain field values as null in `Elasticsearch`

I'm parsing log files which for simplicity's sake let's say will have the following format :
{"message": "hello world", "size": 100, "forward-to": 127.0.0.1}
I'm indexing these lines into an Elasticsearch index, where I've defined a custom mapping such that message, size, and forward-to are of type text, integer, and ip respectively. However, some log lines will look like this :
{"message": "hello world", "size": "-", "forward-to": ""}
This leads to parsing errors when Elasticsearch tries to index these documents. For technical reasons, it's very much untrivial for me to pre-process these documents and change "-" and "" to null. Is there anyway to define which values my mapping should treat as null ? Is there perhaps an analyzer I can write which works on any field type whatsoever that I can add to all entries in my mapping ?
Basically I'm looking for somewhat of the opposite of the null_value option. Instead of telling Elasticsearch what to turn a null_value into, I'd like to tell it what it should turn into a null_value. Also acceptable would be a way to tell Elasticsearch to simply ignore fields that look a certain way but still parse the other fields in the document.
So this one's easy apparently. Add the following to your mapping settings :
{
"settings": {
"index": {
"mapping": {
"ignore_malformed": "true"
}
}
}
}
This will still index the field (contrary to what I've understood from the documentation...) but it will be ignored during aggregations (so if you have 3 entries in an integer field that are "1", 3, and "hello world", an averaging aggregation will yield 2).
Keep in mind that because of the way the option was implemented (and I would say this is a bug) this still fails for and object that is entered as a concrete value and vice versa. If you'd like to get around that you can set the field's enabled value to false like this :
{
"mappings": {
"my_mapping_name": {
"properties": {
"my_unpredictable_field": {
"enabled": false
}
}
}
}
}
This comes at a price though, since this means the field won't be indexed, but the values entered will be still be stored so you can still accessing them by searching for that document through another field. This usually shouldn't be an issue as you likely won't be filtering documents based on the value of such an unpredictable field, but that depends on your specific case use. See here for the official discussion of this issue.

Joining logstash with parent record

I'm using logstash to analyze my web servers access. At this time, it works pretty well. I used a configuration file that produce to me this kind of data :
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345"
...
}
This record are stored into "logstash-2016.10.02" index (one index per day).
I also created an other index named "publications". This index contains the publication metadata.
A json record looks like this :
{
"type": "publication",
"id": "boreal:12345",
"sm_title": "The title of the publication",
"sm_type": "thesis",
"sm_creator": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
"sm_departement": [
"UCL/CORE - Center for Operations Research and Econometrics",
],
"sm_date": "2001",
"ss_state": "A"
...
}
And I would like to create a query like "give me all access for 'Smith, John' publications".
As all my data are not into the same index, I can't use parent-child relation (Am I right ?)
I read this on a forum but it's an old post :
By limiting itself to parent/child type relationships elasticsearch makes life
easier for itself: a child is always indexed in the same shard as its parent,
so has_child doesn’t have to do awkward cross shard operations.
Using logstash, I can't place all data in a single index nammed logstash. By month I have more than 1M access... In 1 year, I wil have more than 15M record into 1 index... and I need to store the web access data for minimum 5 year (1M * 12 * 15 = 180M).
I don't think it's a good idea to deal with a single index containing more than 18M record (if I'm wrong, please let me know).
Is it exists a solution to my problem ? I don't find any beautifull solution.
The only I have a this time in my python script is : A first query to collect all id's about 'Smith, John' publications ; a loop on each publication to get all WebServer access for this specific publication.
So if "Smith, John" has 321 publications, I send 312 http requests to ES and the response time is not acceptable (more than 7 seconds ; not so bad when you know the number of record in ES but not acceptable for final user.)
Thanks for your help ; sorry for my english
Renaud
An idea would be to use the elasticsearch logstash filter in order to get a given publication while an access log document is being processed by Logstash.
That filter would retrieve the sm_creator field in the publications index having the same object_id and enrich the access log with whatever fields from the publication document you need. Thereafter, you can simply query the logstash-* index.
elasticsearch {
hosts => ["localhost:9200"]
index => publications
query => "id:%{object_id}"
fields => {"sm_creator" => "author"}
}
As a result of this, your access log document will look like this afterwards and for "give me all access for 'Smith, John' publications" you can simply query the sm_creator field in all your logstash indices
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345",
"author": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
...
}

Kibana visualization showing wrong results when compared to discover

I am using kibana for visualization on elastic search. I am trying to find the maximum occurring terms in cleaned_keyword_phrases, which is an array of keywords. Basically the cleaned keyword_phrases is an array of skills eg: ["java","spring","ms word"].
The results that I get when I am searching for a query(primary_class:"job" and jobPost:"java developer") is showing correct results when I see it in discover tab, but in visualize tab the results are wrong.
Eg, when i am searching for java developer, these are the results being displayed(these seem right) in quick count in result:
discover result:
Whereas when i try to visualize, the results change(these seem wrong) and are displayed as:
visualize results:
Infact, on changing query to developer from "java developer" the results in quick count in discover change but the results in the visualization tab remain the same. This makes me feel that the query is not being run in visualize tab.
I tried running the query using sense plugin but in that too the results are coming wrong.
Query:
{
"size": 0,
"query": {
"query_string": {
"query": "primary_class:\"job\" and jobPost:\"java developer\"",
"analyze_wildcard": true
}
},
"aggs": {
"3": {
"terms": {
"field": "cleaned_keyword_phrases",
"size": 20,
"order": {
"_count": "desc"
}
}
}
}
}
kibana Version 4.0.2
Build 6004
Commit SHA b286116
Edit: Good results are results which are more related to the query i.e. java developer in this context. Thus results coming up in quick count on the discover tab are "Good" and the ones showing up in the visualize tab seem bad as they are not related(these are not changing when changing the command in kibana).
I had a problem with my hostnames, similar to yours.
The visualization splits a name like vm-xx-yy in vm, xx and yy and show the results for that.
After setting the field from index:analyzed to index:not_analyzed it works correctly.
have you checked your visualisation when attached on a dashboards with same query string in search bar ? If it does apply query string on when on dashboard then may be because here on visualize we are just creating a visualization !

Kibana 4.1 - use JSON input to create an Hour Of Day field from #timestamp for histogram

Edit: I found the answer, see below for Logstash <= 2.0 ===>
Plugin created for Logstash 2.0
Whomever is interested in this with Logstash 2.0 or above, I created a plugin that makes this dead simple:
The GEM is here:
https://rubygems.org/gems/logstash-filter-dateparts
Here is the documentation and source code:
https://github.com/mikebski/logstash-datepart-plugin
I've got a bunch of data in Logstash with a #Timestamp for a range of a couple of weeks. I have a duration field that is a number field, and I can do a date histogram. I would like to do a histogram over hour of day, rather than a linear histogram from x -> y dates. I would like the x axis to be 0 -> 23 instead of date x -> date y.
I think I can use the JSON Input advanced text input to add a field to the result set which is the hour of day of the #timestamp. The help text says:
Any JSON formatted properties you add here will be merged with the elasticsearch aggregation definition for this section. For example shard_size on a terms aggregation which leads me to believe it can be done but does not give any examples.
Edited to add:
I have tried setting up an entry in the scripted fields based on the link below, but it will not work like the examples on their blog with 4.1. The following script gives an error when trying to add a field with format number and name test_day_of_week: Integer.parseInt("1234")
The problem looks like the scripting is not very robust. Oddly enough, I want to do exactly what they are doing in the examples (add fields for day of month, day of week, etc...). I can get the field to work if the script is doc['#timestamp'], but I cannot manipulate the timestamp.
The docs say Lucene expressions are allowed and show some trig and GCD examples for GIS type stuff, but nothing for date...
There is this update to the BLOG:
UPDATE: As a security precaution, starting with version 4.0.0-RC1,
Kibana scripted fields default to Lucene Expressions, not Groovy, as
the scripting language. Since Lucene Expressions only support
operations on numerical fields, the example below dealing with date
math does not work in Kibana 4.0.0-RC1+ versions.
There is no suggestion for how to actually do this now. I guess I could go off and enable the Groovy plugin...
Any ideas?
EDIT - THE SOLUTION:
I added a filter using Ruby to do this, and it was pretty simple:
Basically, in a ruby script you can access event['field'] and you can create new ones. I use the Ruby time bits to create new fields based on the #timestamp for the event.
ruby {
code => "ts = event['#timestamp']; event['weekday'] = ts.wday; event['hour'] = ts.hour; event['minute'] = ts.min; event['second'] = ts.sec; event['mday'] = ts.day; event['yday'] = ts.yday; event['month'] = ts.month;"
}
This no longer appears to work in Logstash 1.5.4 - the Ruby date elements appear to be unavailable, and this then throws a "rubyexception" and does not add the fields to the logstash events.
I've spent some time searching for a way to recover the functionality we had in the Groovy scripted fields, which are unavailable for scripting dynamically, to provide me with fields such as "hourofday", "dayofweek", et cetera. What I've done is to add these as groovy script files directly on the Elasticsearch nodes themselves, like so:
/etc/elasticsearch/scripts/
hourofday.groovy
dayofweek.groovy
weekofyear.groovy
... and so on.
Those script files contain a single line of Groovy, like so:
Integer.parseInt(new Date(doc["#timestamp"].value).format("d")) (dayofmonth)
Integer.parseInt(new Date(doc["#timestamp"].value).format("u")) (dayofweek)
To reference these in Kibana, firstly create a new search and save it, or choose one of your existing saved searches (Please take a copy of the existing JSON before you change it, just in case) in the "Settings -> Saved Objects -> Searches" page. You then modify the query to add "Script Fields" in, so you get something like this:
{
"query" : {
...
},
"script_fields": {
"minuteofhour": {
"script_file": "minuteofhour"
},
"hourofday": {
"script_file": "hourofday"
},
"dayofweek": {
"script_file": "dayofweek"
},
"dayofmonth": {
"script_file": "dayofmonth"
},
"dayofyear": {
"script_file": "dayofyear"
},
"weekofmonth": {
"script_file": "weekofmonth"
},
"weekofyear": {
"script_file": "weekofyear"
},
"monthofyear": {
"script_file": "monthofyear"
}
}
}
As shown, the "script_fields" line should fall outside the "query" itself, or you will get an error. Also ensure the script files are available to all your Elasticsearch nodes.

Resources