Search with complete suggester and german analyzer - elasticsearch

I created a simple index with a suggest field and a completion type. I indexed some city names. For the suggest field I use a german analyzer.
PUT city_de
{
"mappings": {
"city" : {
"properties": {
"name" : {
"type": "text",
"analyzer": "german"
},
"suggest" : {
"type": "completion",
"analyzer": "german"
}
}
}
}
}
The analyzer works fine and the search by using umlauts is good. Also the autocompletion is perfect. But I faced an issue by searching for the term wie.
Lets say I have two documents Wiesbaden and Wien with the same name as suggest completion term.
If I searching for wie I assume that the cities Wien and Wiesbaden are in the response. But unfortunately I get no response. I suppose that wie has a restriction because of the german analyzer. Because if I search for wi or wies I get valid responses.
Same is for term was, er, sie, und which looks like stemming words in german.
Do I need any additional configuration to get also a result if I search for wie or was?
Thanks!

The problem
Searching city names by prefix
"wie" should find "Wien" or "Wiesbaden"
Possible solution approach
For the usecase I would suggest using an edge n-gram https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html and ASCII folding the terms https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-asciifolding-tokenfilter.html.
Example
wien
token position start offset end offset
w 0 0 1
wi 1 0 2
wie 2 0 3
wien 3 0 4
wiesbaden
token position start offset end offset
w 0 0 1
wi 1 0 2
wie 2 0 3
wies 3 0 4
...
wiesbaden 8 0 9
Keep in mind that the system has to work in a asymmetric way now. The query should not be analyzed (use keyword) but the data in the index has to be analyzed.
There are two ways to achieve this:
1.) Add the query analyzer to use the query
2.) Bind the query analyzer to the field
"cities": {
"type": "text",
"fields": {
"autocomplete": {
"type": "text",
"analyzer": "autocomplete_analyzer", <-- index time analyzer
"search_analyzer": "autocomplete_search" <-- search time analyzer
}
}
}
Why does the german analyzer not work
The analyzer is designed for german text and uses an easy algorithm to remove flection and morphology.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#german-analyzer
Here is an example for the typical terms generated by this tokenizer
Hallo hier ist der Text über Wiesbaden und Wien. Es scheint angebracht über Wände und Wandern zu sprechen.
hallo 0 0 5
text 4 19 23
wiesbad 6 29 38
wien 8 43 47
scheint 10 52 59
angebracht 11 60 70
wand 13 76 81
wandern 15 86 93
sprech
If it works on city names this happens just by coincidence.

Related

Find sequences in time series data using Elasticsearch

I'm trying to find example Elasticsearch queries for returning sequences of events in a time series. My dataset is rainfall values at 10-minute intervals, and I want to find all storm events. A storm event would be considered continuous rainfall for more than 12 hours. This would equate to 72 consecutive records with a rainfall value greater than zero. I could do this in code, but to do so I'd have to page through thousands of records so I'm hoping for a query-based solution. A sample document is below.
I'm working in a University research group, so any solutions that involve premium tier licences are probably out due to budget.
Thanks!
{
"_index": "rabt-rainfall-2021.03.11",
"_type": "_doc",
"_id": "fS0EIngBfhLe-LSTQn4-",
"_version": 1,
"_score": null,
"_source": {
"#timestamp": "2021-03-11T16:00:07.637Z",
"current-rain-total": 8.13,
"rain-duration-in-mins": 10,
"last-recorded-time": "2021-03-11 15:54:59",
"rain-last-10-mins": 0,
"type": "rainfall",
"rain-rate-average": 0,
"#version": "1"
},
"fields": {
"#timestamp": [
"2021-03-11T16:00:07.637Z"
]
},
"sort": [
1615478407637
]
}
Update 1
Thanks to #Val my current query is
GET /rabt-rainfall-*/_eql/search
{
"timestamp_field": "#timestamp",
"event_category_field": "type",
"size": 100,
"query": """
sequence
[ rainfall where "rain-last-10-mins" > 0 ]
[ rainfall where "rain-last-10-mins" > 0 ]
until [ rainfall where "rain-last-10-mins" == 0 ]
"""
}
Having a sequence query with only one rule causes a syntax error, hence the duplicate. The query as it is runs but doesn't return any documents.
Update 2
Results weren't being returned due to me not escaping the property names correctly. However, due to the two sequence rules I'm getting matches of length 2, not of arbitrary length until the stop clause is met.
GET /rabt-rainfall-*/_eql/search
{
"timestamp_field": "#timestamp",
"event_category_field": "type",
"size": 100,
"query": """
sequence
[ rainfall where `rain-last-10-mins` > 0 ]
[ rainfall where `rain-last-10-mins` > 0 ]
until [ rainfall where `rain-last-10-mins` == 0 ]
"""
}
This would definitely be a job for EQL which allows you to return sequences of related data (ordered in time and matching some constraints):
GET /rabt-rainfall-2021.03.11/_eql/search?filter_path=-hits.events
{
"timestamp_field": "#timestamp",
"event_category_field": "type",
"size": 100,
"query": """
sequence with maxspan=12h
[ rainfall where `rain-last-10-mins` > 0 ]
until `rain-last-10-mins` == 0
"""
}
What the above query seeks to do is basically this:
get me the sequence of events of type rainfall
with rain-last-10-mins > 0
happening within a 12h window
up until rain-last-10-mins drops to 0
The until statement makes sure that the sequence "expires" as soon as an event has rain-last-10-mins: 0 within the given time window.
In the response, you're going to get the number of matching events in hits.total.value and if that number is 72 (because the time window is limited to 12h), then you know you have a matching sequence.
So your "storm" signal here is to detect whether the above query returns hits.total.value: 72 or lower.
Disclaimer: I haven't tested this, but in theory it should work the way I described.

Elasticsearch count overlapping timeranges in date histogram

I have events stored in Elasticsearch 6.6 that have a start and end time e.g.:
{
"startTime": "2019-01-11T14:49:16.719Z"
"endTime": "2019-01-11T16:31:56.483Z"
}
I want to display a date histogram which shows the number of overlapping events in each hour.
Example:
Hour of Day:
12 13 14 15 16 17 18 19
Events:
<====E1====> <===E2==>
<===E3====>
<==E4==>
Result:
0 1 1 3 2 2 1 0
Is there a way to do this with an elasticsearch aggregation or do I have to implement it in the application?

Spring Boot Actuator 'http.server.requests' metric MAX time

I have a Spring Boot application and I am using Spring Boot Actuator and Micrometer in order to track metrics about my application. I am specifically concerned about the 'http.server.requests' metric and the MAX statistic:
{
"name": "http.server.requests",
"measurements": [
{
"statistic": "COUNT",
"value": 2
},
{
"statistic": "TOTAL_TIME",
"value": 0.079653001
},
{
"statistic": "MAX",
"value": 0.032696019
}
],
"availableTags": [
{
"tag": "exception",
"values": [
"None"
]
},
{
"tag": "method",
"values": [
"GET"
]
},
{
"tag": "status",
"values": [
"200",
"400"
]
}
]
}
I suppose the MAX statistic is the maximum time of execution of a request (since I have made two requests, it's the the time of the longer processing of one of them).
Whenever I filter the metric by any tag, like localhost:9090/actuator/metrics?tag=status:200
{
"name": "http.server.requests",
"measurements": [
{
"statistic": "COUNT",
"value": 1
},
{
"statistic": "TOTAL_TIME",
"value": 0.029653001
},
{
"statistic": "MAX",
"value": 0.0
}
],
"availableTags": [
{
"tag": "exception",
"values": [
"None"
]
},
{
"tag": "method",
"values": [
"GET"
]
}
]
}
I am always getting 0.0 as a max time. What is the reason of this?
What does MAX represent (MAX Discussion)
MAX represents the maximum time taken to execute endpoint.
Analysis for /user/asset/getAllAssets
COUNT TOTAL_TIME MAX
5 115 17
6 122 17 (Execution Time = 122 - 115 = 17)
7 131 17 (Execution Time = 131 - 122 = 17)
8 187 56 (Execution Time = 187 - 131 = 56)
9 204 56 From Now MAX will be 56 (Execution Time = 204 - 187 = 17)
Will MAX be 0 if we have less number of request (or 1 request) to the particular endpoint?
No number of request for particular endPoint does not affect the MAX (see an image from Spring Boot Admin)
When MAX will be 0
There is Timer which set the value 0. When the endpoint is not being called or executed for sometime Timer sets MAX to 0. Here approximate timer value is 2 to 2.30 minutes (120 to 150 seconds)
DistributionStatisticConfig has .expiry(Duration.ofMinutes(2)) which sets the some measutement to 0 if there is no request has been made for last 2 minutes (120 seconds)
Methods such as public TimeWindowMax(Clock clock,...), private void rotate() Clock interface has been written for the same. You may see the implementation here
How I have determined the timer value?
For that, I have taken 6 samples (executed the same endpoint for 6 times). For that, I have determined the time difference between the time of calling the endpoint - time for when MAX set back to zero
MAX property belongs to enum Statistic which is used by Measurement
(In Measurement we get COUNT, TOTAL_TIME, MAX)
public static final Statistic MAX
The maximum amount recorded. When this represents a time, it is
reported in the monitoring system's base unit of time.
Notes:
This is the cases from metric for a particular endpoint (here /actuator/metrics/http.server.requests?tag=uri:/user/asset/getAllAssets).
For generalize metric of actuator/metrics/http.server.requests
MAX for some endPoint will be set backed to 0 due to a timer. In my view for MAX for /http.server.requests will be same as a particular endpoint.
UPDATE
The document has been updated for the MAX.
NOTE: Max for basic DistributionSummary implementations such as
CumulativeDistributionSummary, StepDistributionSummary is a time
window max (TimeWindowMax). It means that its value is the maximum
value during a time window. If the time window ends, it'll be reset to
0 and a new time window starts again. Time window size will be the
step size of the meter registry unless expiry in
DistributionStatisticConfig is set to other value explicitly.
You can see the individual metrics by using ?tag=url:{endpoint_tag} as defined in the response of the root /actuator/metrics/http.server.requests call. The details of the measurements values are;
COUNT: Rate per second for calls.
TOTAL_TIME: The sum of the times recorded. Reported in the monitoring system's base unit of time
MAX: The maximum amount recorded. When this represents a time, it is reported in the monitoring system's base unit of time.
As given here, also here.
The discrepancies you are seeing is due to the presence of a timer. Meaning after some time currently defined MAX value for any tagged metric can be reset back to 0. Can you add some new calls to your endpoint then immediately do a call to /actuator/metrics/http.server.requests to see a non-zero MAX value for given tag?
This is due to the idea behind getting MAX metric for each smaller period. When you are seeing these metrics, you will be able to get an array of MAX values rather than a single value for a long period of time.
You can get to see this in action within Micrometer source code. There is a rotate() method focused on resetting the MAX value to create above described behaviour.
You can see this is called for every poll() call, which is triggered every some period for metric gathering.

Logstash: how to use syslog_pri plugin

I'm using Elastic stack. There are a lot of messages which are parsed by my Logstash. I've decided to add some additional rules for Logstash.
I've installed Syslog pri plugin to my Logstash, because I want to create some mapping for my syslog's severity levels.
All of my messages has syslog_pri values according to RFC-3164, where error messages has "(3, 11, 19, ..., 187)" values of syslog_pri.
Well, I have two problems:
1) It's not very usable for me, because querying via Kibana is not usable. When I want to filter errors, it looks like:
syslog_pri: (3 OR 11 OR 19 OR 27 OR 35 OR 43 OR 51 OR 59 OR 67 OR 75 OR 83 OR 91 OR 99 OR 107 OR 115 OR 123 OR 131 OR 139 OR 147 OR 155 OR 163 OR 171 OR 179 OR 187)
but it will be much easier with syslog_pri plugin. I expect to have something like this:
syslog_pri: "error"
Is it possible to create this mapping somehow?
2) I want to change this syslog_pri value for some specific messages.
For example, I'm catching message like "Hello world" and want to change the severity from 14 (info messages) into 11 (error message).
I'm doing something like this:
filter {
grok {
match => { "message" => "..." }
}
syslog_pri { }
if "Hello world" in [message]
{
mutate { syslog_pri => 11 }
}
But this failed with an error:
logstash.filters.mutate - Unknown setting 'syslog_pri' for mutate
Suggestions?
To use the syslog_pri filter, you simply need to have a field with the value, which in turn will be decoded by the filter. If you have a field which is already named syslog_pri, then using it is as simple as putting
syslog_pri { }
in your logstash configuration.
This plugin will create 4 additional fields which will contain the decoded syslog_pri information:
syslog_facility
syslog_severity
syslog_facility_code
syslog_severity_code
As for mutating a field the syntax is as follows.
mutate {
replace => { "syslog_pri" => "11"}
}

ElasticSearch - Slop use case

I'm pretty new in Elastic Search, and I'm trying to use slop attribute to solve a problem.
The problem :
I have a multi_match query, and one of the field is full of codes like :
Example 1 : "AE 102 BR, V 415 A, K45863"
Example 2 : "AE 100 BR, AE 101 BR, AE 103 BR, AE 104 BR"
The problem is that sometimes the Example 2 will be choosen for the query "AE 102 BR", because ES find multiple times "AE" and "BR".
The solution I want is boost the "close matching", by this I mean if I have 1 "3 consecutives words" match, it has to be always more relevant than 4 "2 consecutives words" match.
What I've try :
"multi_match": {
"fields": [ "field1^10", "field2^3",
"field3^3", "**field_code**^3", "global_compatibilities^1" ]
, "query": "AE 102 BR",
"slop": 1
}
but it doesn't work (the "slop" doesn't change anything to the score).
Can someone explains me better how to use slop in my case ?

Resources