elasticsearch: extract number from a field - elasticsearch

I'm using elasticsearch and kibana for storing my logs.
Now what I want is to extract a number from a field and store it a new field.
So for instance, having this:
accountExist execution time: 1046 ms
I would like to extract the number (1046) and see it in a new field in kibana.
Is it possible? how?
Thanks for the help

You'll need to do this before/during indexing.
Within Elasticsearch, you can get what you need during indexing:
Define a new analyzer using the Pattern Analyzer to wrap a regular expression (for your purposes, to capture consecutive digits in the string - good answer on this topic).
Create your new numeric field in the mapping to hold the extracted times.
Use copy_to to copy the log message from the input field to the new numeric field from (2) where the new analyzer will parse it.
The Analyze API can be helpful for testing purposes.

While not performant, if you must avoid reindexing, you could use scripted fields in kibana.
Introduction here: https://www.elastic.co/blog/using-painless-kibana-scripted-fields
enable painless regex support by putting the following in your elasticsearch.yaml:
script.painless.regex.enabled: true
restart elasticsearch
create a new scripted field in Kibana through Management -> Index Patterns -> Scripted Fields
select painless as the language and number as the type
create the actual script, for example:
def logMsg = params['_source']['log_message'];
if(logMsg == null) {
return -10000;
}
def m = /.*accountExist execution time: ([0-9]+) ms.*$/.matcher(params['_source']['log_message']);
if ( m.matches() ) {
return Integer.parseInt(m.group(1))
} else {
return -10000
}
you must reload the website completely for the new fields to be executed, simply re-doing a search on an open discover site will not pick up the new fields. (This almost made me quit trying to get this working -.-)
use the script in discover or visualizations
While I do understand, that it is not performant to script fields for millions of log entries, my usecase is a very specific log entry, that is logged 10 times a day in total and I only use the resulting fields to create a visualization or in analysis where I reduce the candidates through regular queries in advance.
Would be interesing if it is possible to have those fields only be calculated in situations where you need them (or they make sense & are computable to begin with; i.e. to make the "return -1000" unnecessary). Currently they will be applied and show up for every log entry.
You can generate scripted fields inside of queries like this: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-script-fields.html but that seems a bit too much of burried under the hood, to maintain easily :/

Related

How to modify data table column values in Kibana

In Elastic search we store events, I have built data table which aggregates based on event types. I have filter which checks for event.keyword : "job-completed". I am getting the count as 1 or 0 but i want to display as completed / in-progress.
How to achieve this in Kibana ?
The best and more effective way to do this is to add another field with and to complete it at the ingest time.
It the best solution regarding on performance. But it can lead to an heavy work.
You can also use a scripted field to do this without touching your data.
Do to stack management > kibana > index pattern and select your index.
Select scripted field tab and fill in form.
Name : your_field
language: painless
type: string
format: string
script:
if(doc['event.keyword'].value=='job-completed'){
return "completed";
}else {
return "in progress";
}
I got to few information on your real data to be able to give you a working code, so you'll have to modify it to fit your needs.
Then refresh you visualization and you can use your new field

faster search for a substring through large document

I have a csv file of more than 1M records written in English + another language. I have to make a UI that gets a keyword, search through the document, and returns record where that key appears. I look for the key in two columns only.
Here is how I implemented it:
First, I made a postgres database for the data stored in the CSV file. Then made a classic website where the user can enter a keyword. This is the SQL query that I use(In spring boot)
SELECT * FROM table WHERE col1 LIKE %:keyword% OR col2 LIKE %:keyword%;
Right now, it is working perfectly fine, but I was wondering how to make search faster? was using SQL instead of classic document search better?
If the document is only searched once and thrown away, then it's overhead to load into a database. Instead can search the file directly using the nio parallel search feature which uses multiple threads to concurrently search the file:
List<Record> result = Files.lines("some/path")
.parallel()
.unordered()
.map(l -> lineToRecord(l))
.filter(r -> r.getCol1().contains(keyword) || r.getCol2().contains(keyword))
.collect(Collectors.toList());
NOTE: need to provide the lineToRecord() method and the Record class.
If the document is going to be searched over and over again, then can think about indexing the document. This means pre-processing the document to suit the search requirements. In this case it's keywords of col1 and col2. An index is like a map in java, eg:
Map<String, Record> col1Index
But since you have the "LIKE" semantics, this is not so easy to do as it's not as simple as splitting the string by white space since the keyword could match a substring. So in this case it might be best to look for some tool to help. Typically this would be something like solr/lucene.
Databases can also provide similar functionality eg: https://www.postgresql.org/docs/current/pgtrgm.html
For LIKE queries, you should look at the pg_trgm index type with the gin_trgm_ops operator class. You shouldn't need to change query at all, just build the index on each column. Or maybe one multi-column index.

How to overcome maxClauseCount error when using multi_match query

I have 10+ Indexes on my Elasticsearch server.
Each Index has 1 or more fields with different kind of Analyzers: keyword, standard, ngram and etc...
For Global search I am using multi_match without specifying any explicit fields.
For querying I am using using elasticsearch-dsl library, the code is bellow:
def search_for_index(indice, term, num_of_result=10):
s = Search(index=indice).sort({"_score": "desc"})
s = s[:num_of_result]
s = s.query('multi_match', query=term, operator='and')
response = s.execute()
return response.to_dict()['hits']['hits']
I get very good result, and search is working just fine, but sometimes someone enters a bit longer text, and I am getting maxClauseCount error.
For example, search that raises an error when search term term is equal to:
term=We are working on your request and will keep you posted at the earliest.
Or any other little longer text raises the same error.
Can you help me figure it out maybe some better approach for my Global search so that I can avoid this kind of error?
First of all - this limitation is there for a reason. The more boolean clauses you have - the heavier search would be. Think of it as crossing (AND) or joining (OR) subset of document ids for each of the clause. This is very heavy operation, that is why initially it has a limit of 1024 clauses.
General recommendation would be to try reduce number of fields you're searching. Maybe you have fields which consist no text data or just have some internal ids. You could cross them out during multi_match query by specifying fields section explicitly.
If you're still decided to go with current approach and you're using Elasticsearch 5.5+ and higher you could alter those by adding following line in elasticsearch.yml and restart your instance.
indices.query.bool.max_clause_count: 250000
If you're using pre-5 version of Elasticsearch the setting is called index.query.bool.max_clause_count

Elasticsearch + Logstash: How to add a fields based on existing data at importing time

Currently, I'm importing data into Elastic through logstash, at this time by reading csv files.
Now let's say I have two numeric fields in the csv, age, and weight.
I would need to add a 3rd field on the fly, by making a math on the age, the weight and another external data ( or function result ); and I need that 3rd field to be created when importing the data.
There is any way to do this?
What could be the best practice?
In all Logstash filter sections, you can add fields via add_field, but that's typically static data.
Math calculations need a separate plugin
As mentioned there, the ruby filter plugin would probably be your best option. Here is an example snippet for your pipeline
filter {
# add calculated field, for example BMI, from height and weight
ruby {
code => "event['data']['bmi'] = event['data']['weight'].to_i / (event['data']['height'].to_i)"
}
}
Alternatively, in Kibana, there are Scripted fields meant to be visualized, but cannot be queried

analyzed field vs doc_values: true field

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!
Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.
It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

Resources