Logstash: use line number of the log file as document_id - elasticsearch

I want to set the document_id of Logstash to the line number of the log file as below: (FYI, why I need to do this is shown here)
elasticsearch {
host => yourEsHost
cluster => "yourCluster"
index => "logstash-%{+YYYY.MM.dd}"
document_id => "%{lineNumber}"
}
For example, if the log file is:
64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
64.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 7352
64.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 5253
I want the document_id of 3 documents to be 0, 1, 2 respectively.
In my scenario, one Elasticsearch index is generated from only one log file. It guarantees that such document_id will not be duplicated inside one index.
Is there any way to achieve this? Thanks.

According the answer here: https://discuss.elastic.co/t/get-line-number-of-the-log-file-line-being-processed/40960, it is not possible for now. But there is an open issue about: https://github.com/logstash-plugins/logstash-input-file/issues/7. So it may be possible in a future version. For know modifying file input plugin or writing your own input plugin.

Related

How can I use NiFi processor RouteOnContent

I'm trying to read a log file like that one:
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085
burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0
199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179
I'm sending 1000 lines each time I run this exercise, and I'm using a splitText processor, and in the extractText processor I use this regex:
successCode -> ^[0-9A-Z\-a-z\.]* - - \[[0-9A-Za-z\/\:]* -[0-9]*\] \"[A-Z]* [0-9A-Za-z\/\.\- ]*\" ([0-9]*) [0-9]*
tiemStamp -> ^[0-9A-Z\-a-z\.]* - - \[([0-9A-Za-z\/\:]*) -[0-9]*\] \"[A-Z]* [0-9A-Za-z\/\.\- ]*\" [0-9]* [0-9]*
important -> ^([0-9A-Z\-a-z\.]*) - - \[[0-9A-Za-z\/\:]* -[0-9]*\] \"[A-Z]* [0-9A-Za-z\/\.\- ]*\" [0-9]* [0-9]*
It can be a mistake on it. Surely here is my problem.
Then, I tryed to send different logs to different routes. If successCode == 200 then I tried to put it on route /user//success/%{tiemStamp}/, but all my lines go to the third way: "unmatched"
On the RouteOnContent processor I've tryed:
successCode -> ${successCode:equals("200")}
successCode -> ${successCode:contains(2)}
successCode -> ${successCode:contains("2")}
Has anyone worked with "RouteOnContent" processor?
According to the documentation, the ExtractText Processor "Evaluates one or more Regular Expressions against the content of a FlowFile. The results of those Regular Expressions are assigned to FlowFile Attributes [...]"
So you should not use a RouteOnContent but a RouteOnAttribute processor in the next step.
(If you stop your RouteOnXXX processor in order to keep the messages in the queue, you can see the content of the flowfiles. On the "Attributes" tab of a flowfile, you can see the values of the different attributes. And I confirm that with your regexp, I have successCode=200. )
Basically you can use both RouteOnAttribute or RouteOnText, but each uses different parameters.
If you chose to use ExtractText, the properties you defined are populated for each row (after the original file was split by SplitText processor).
Now, you have two options:
Route based on the attributes that have been extracted (RouteOnAttribute).
Route based on the content (RouteOnContent). In this case, you don't really need to use Extract Text.
Each processor routes the FlowFile differently:
RouteOnAttribute queries the attributes of the FlowFile (a NiFi Expression Language query). For example, let's say I defined the property 'name', routing based on its value can be:
On the other hand, RouteOnContext queries the content of the FlowFile based on a regex expression. For example:
After defining these parameters, you can continue to route based on these dynamic relationships:

Using Rack to run a React application is not working as it should

My idea is using Rack to load a web application separated in two parts. This is the structure:
my_app
|______api
| |______poros
|
|______fe
| |______build
| |______node_modules
| |______package.json
| |______public
| | |______ index.html
| | ...
| |______README.md
| |______src
| |______index.js
| ...
|______config.ru
The fe part (which stands for front-end) is a React application created with create-react-app. I've built it with
npm run build
and then its static optimized build is under my_app/fe/build directory.
This is my my_app/config.ru:
#\ -w -p 3000
require 'emeraldfw'
use Rack::Reloader, 0
use Rack::ContentLength
run EmeraldFW::Loader.new
And this is my EmeraldFW::Loader class, which is part of a gem installed and running fine.
module EmeraldFW
class Loader
def call(env)
path_info = (env['PATH_INFO'] == '/') ? '/index.html' : env['PATH_INFO']
extension = path_info.split('/').last.split('.').last
file = File.read("fe/build#{path_info}")
[200,{'Content-Type' => content_type[extension]},[file]]
end
def content_type
{
'html' => 'text/html',
'js' => 'text/javascript',
'css' => 'text/css',
'svg' => 'image/svg+xm',
'ico' => 'image/x-icon',
'map' => 'application/octet-stream'
}
end
end
end
As you may see, this is all quite simple. I capture the request with in my EmeraldFW::Loader class and transform its path_info it a bit. If the request is '/' I rewrite it to '/index.html' before doing anything else. In all cases I prepend fe/build to make it load from the static build of the React application.
When I run
rackup config.ru
and load the application at http://localhost:3000 the result is completely fine:
[2017-03-15 21:28:23] INFO WEBrick 1.3.1
[2017-03-15 21:28:23] INFO ruby 2.3.3 (2016-11-21) [x86_64-linux]
[2017-03-15 21:28:23] INFO WEBrick::HTTPServer#start: pid=11728 port=3000
::1 - - [15/Mar/2017:21:28:27 -0300] "GET / HTTP/1.1" 200 386 0.0088
::1 - - [15/Mar/2017:21:28:27 -0300] "GET /static/css/main.9a0fe4f1.css HTTP/1.1" 200 623 0.0105
::1 - - [15/Mar/2017:21:28:27 -0300] "GET /static/js/main.91328589.js HTTP/1.1" 200 153643 0.0086
::1 - - [15/Mar/2017:21:28:28 -0300] "GET /static/media/logo.5d5d9eef.svg HTTP/1.1" 200 2671 0.0036
::1 - - [15/Mar/2017:21:28:28 -0300] "GET /static/js/main.91328589.js.map HTTP/1.1" 200 1705922 0.1079
::1 - - [15/Mar/2017:21:28:28 -0300] "GET /static/css/main.9a0fe4f1.css.map HTTP/1.1" 200 105 0.0021
As you may see, all resources are loading correctly, with the correct mime types. But it happens that my default React logo, which should be spinning in my app frontpage, is not there! As if it wasn't loaded.
The big picture of all this is having this rack middleware loading the React front-end of the app and, at the same time, redirecting correctly the api requests made with fetch from the front-end to the poros (Plain Old Ruby Objects) which are the API part.
The concept is quite simple. But I just can't understand why this specific resource, the svg logo, is not loading.
It was all a matter of wrong mime type, caused by a typo.
It may be clearly seen in my question that I wrote:
def content_type
{
'html' => 'text/html',
'js' => 'text/javascript',
'css' => 'text/css',
'svg' => 'image/svg+xm', # <== Here is the error!!!
'ico' => 'image/x-icon',
'map' => 'application/octet-stream'
}
end
when the correct mime type for SVG files is 'image/svg-xml', with an 'l'...
This was making the browser ignore the file received and so it won't display it correctly.

Elasticsearch best_compression is not working

I am parsing Apache access log from Logstash and indexing it into a Elasticsearch index. I have also indexed geoip and agent fields.. While indexing I observed elasticsearch index size is 6.7x bigger than the actual file size (space on disk). So I just want to understand this is the correct behavior or I am doing something wrong here? I am using Elasticsearch 5.0, Logstash 5.0 and Kibana 5.0 version. I also tried best_compression but it's taking same disk size. Here is the complete observation with configuration file I tried so far.
My Observations:
Use Case 1:
Logstash Conf
Template File
Apache Log file Size : 211 MB
Total number of lines: 1,000,000
Index Size: 1.5 GB
Observation: Index is 6.7x bigger than the file size.
Use Case 2:
Logstash Conf
Template File
I have found a few solutions to compress elasticsearch index, then I tried it as well.
- Disable `_all` fields
- Remove unwanted fields that has been created by `geoip` and `agent` parsing.
- Enable `best_compression` [ index.codec": "best_compression"]
Apache Log file Size : 211 MB
Total number of lines: 1,000,000
Index Size: 1.3 GB
Observation: Index is 6.16x bigger than the file size
Log File Format:
127.0.0.1 - - [24/Nov/2016:02:03:08 -0800] "GET /wp-admin HTTP/1.0" 200 4916 "http://trujillo-carpenter.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 5.01; Trident/5.1)"
I found Logstash + Elasticsearch Storage Experients they are saying they have reduced index size from 6.23x to 1.57x. But that is pretty old solutions and these solution are no more working in Elasticsearch 5.0.
Some more reference I have already tried:
- Part 2.0: The true story behind Elasticsearch storage requirements
- https://github.com/elastic/elk-index-size-tests
Is there any better way to optimize the Elasticseach index size when your purpose is only show the visualization on Kibana?
I was facing this issue due to index settings were not applied to the index. My index name and template name were different. After using the same template name and index name compression is applied properly.
In the below example I was using index name apache_access_logs and template name elk_workshop.
Sharing corrected template and logstash configuration.
Logstash.conf
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "apache_access_logs"
template => "apache_sizing_2.json"
template_name => "apache_access_logs" /* it was elk_workshop */
template_overwrite => true
}
}
Template:
{
"template": "apache_access_logs", /* it was elk_workshop */
"settings": {
"index.refresh_interval": "5s",
"index.refresh_interval": "30s",
"number_of_shards": 5,
"number_of_replicas": 0
},
..
}
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html#indices-templates

Create a new index in elasticsearch for each log file by date

Currently
I have completed the above task by using one log file and passes data with logstash to one index in elasticsearch :
yellow open logstash-2016.10.19 5 1 1000807 0 364.8mb 364.8mb
What I actually want to do
If i have the following logs files which are named according to Year,Month and Date
MyLog-2016-10-16.log
MyLog-2016-10-17.log
MyLog-2016-10-18.log
MyLog-2016-11-05.log
MyLog-2016-11-02.log
MyLog-2016-11-03.log
I would like to tell logstash to read by Year,Month and Date and create the following indexes :
yellow open MyLog-2016-10-16.log
yellow open MyLog-2016-10-17.log
yellow open MyLog-2016-10-18.log
yellow open MyLog-2016-11-05.log
yellow open MyLog-2016-11-02.log
yellow open MyLog-2016-11-03.log
Please could I have some guidance as to how do i need to go about doing this ?
Thanks You
It is also simple as that :
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "MyLog-%{+YYYY-MM-DD}.log"
}
}
If the lines in the file contain datetime info, you should be using the date{} filter to set #timestamp from that value. If you do this, you can use the output format that #Renaud provided, "MyLog-%{+YYYY.MM.dd}".
If the lines don't contain the datetime info, you can use the input's path for your index name, e.g. "%{path}". To get just the basename of the path:
mutate {
gsub => [ "path", ".*/", "" ]
}
wont this configuration in output section be sufficient for your purpose ??
output {
elasticsearch {
embedded => false
host => localhost
port => 9200
protocol => http
cluster => 'elasticsearch'
index => "syslog-%{+YYYY.MM.dd}"
}
}

Substring to hash key issue?

I have a log file and need to create a hash key for each URL in the record. Each line from the record has been placed into an array and I am looping through the array assigning hash keys.
I need to get from this:
"2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"
to this:
"/logschecks/scripts/setup1.php"
I have tried using match, scan and split but they have both failed to get me where I need to go.
My method currently looks like:
def pathHistogram (rowsInFile)
i = 0
urlHash = Hash.new
while i <= rowsInFile.length - 1
urlKey = rowsInFile[i].scan(/<"GET ">/).last.first
if urlHash.has_key?(urlKey) == true
#get the number of stars already in there and add one.
urlHash[urlKey] = urlHash[urlKey] + '*'
i = i + 1
else
urlHash[urlKey] = '*'
i = i + 1
end
end
end
I know that just scanning the "GET " won't complete the job but I was trying to baby-step through it. The match and split versions that I tried were fairly epic-fails, but I was likely using them incorrectly and they are long gone.
Running this script gives me an undefined method error on "first", though I have gotten other errors when I vary the way this is handled.
I should also say I am not married to using scan. If another method would work better, I would be more than happy to switch.
Any help would be greatly appreciated.
You state in a comment to the other answer the pattern is basically "GET ... HTTP, where you are interested in the ... part. That can be extracted very easily:
line = '2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"'
line[/"GET (.*?) HTTP/, 1]
# => "/logschecks/scripts/setup1.php"
Assuming each of your input lines contains /logschecks/...:
x = "2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: \"GET /logschecks/scripts/setup1.php HTTP/1.1\", host: \"www.example.com\""
x[%r(/logscheck[/\w\.]+)] # => "/logschecks/scripts/setup1.php"
Scanning HTTP logs isn't hard, but how you go about it will vary depending on the format. In the sample you're giving it's easier than a standard log because you have some landmarks you can look for:
Search for request: " using something like:
/request: "\S+ (\S+)/i
That pattern will skip over GET, POST, HEAD or whatever method was used for the request.
log_line[/request: "\S+ (\S+)/i, 1] # => "/logschecks/scripts/setup1.php"
You might want to know that if you're mining your logs. In that case...
Search for request: "[GET|POST|HEAD|...] using something like:
/request: "(\S+) (\S+)/i
You'd use it like:
method, url = log_line.match(/request: "(\S+) (\S+)/i).captures # => ["GET", "/logschecks/scripts/setup1.php"]
method # => "GET"
url # => "/logschecks/scripts/setup1.php"
You can also grab whatever is inside the double-quotes, then split it to get at the parts:
/request: "([^"]+)"/i
For instance:
log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"]
method, url, http_ver = log_line[/request: "([^"]+)"/i, 1].split # => ["GET", "/logschecks/scripts/setup1.php", "HTTP/1.1"]
method # => "GET"
url # => "/logschecks/scripts/setup1.php"
http_ver # => "HTTP/1.1"
Or use a bit more complex pattern, using some of the modern extensions and reduce the code:
log_line = %[2010/08/23 15:25:35 [error]: (4: No such file or directory), clent: 80.154.42.54, server: localhost, request: "GET /logschecks/scripts/setup1.php HTTP/1.1", host: "www.example.com"]
/request: "(?<method>\S+) (?<url>\S+) (?<http_ver>\S+)"/i =~ log_line
method # => "GET"
url # => "/logschecks/scripts/setup1.php"
http_ver # => "HTTP/1.1"

Resources