How do I not match a bare hyphen in Elasticsearch? - elasticsearch

I am querying apache logs stored in Elasticsearch. I want to return log entries from a given hostname that has a hyphen and with a populated auth field.
These strings should be an exact match: "hostname": "example-dev" and not "auth": "-".
My questions are:
How do I correctly remap a type in Elasticsearch to allow a hyphen to be part of the matched string.
How do I correctly query a type in Elasticsearch with a bare hyphen.
The hyphen is a reserved character in Elasticsearch, so I understand it takes special effort. However, I'm having what seems like a lot of trouble figuring out how to include it in my query.
I have tried to remap the type to be not_analysed. It looks like the format has recently changed. The old way of defining the index ("analysed", "not_analysed", and "no") makes sense to me. The new way (true or false) does not. In either case, I cannot seem to get remapping to work.
Here is my attempt at remapping:
DELETE /search
PUT search
{
"mappings" : {
"beat" : {
"properties" : {
"hostname" : {
"type" : "text",
"norms" : false,
"index" : false
}
}
}
}
}
I have not included the remapping of the auth field because it only returns a mapper_parsing_exception.
I am using json to query Elasticsearch. Here is my query:
GET _search
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
{
"match": {
"beat.hostname": "example-dev"
}
}
],
"must_not": [
{
"match": {
"auth.keyword": "-"
}
}
]
}
}
}
}
}
I have tried escaping the hyphen with \\- but that returns results that match "auth": "-". The hostname still does not match exactly. The hostname query also matches something like "example-prod".
I have tried using "term" rather than "match"; that returns no results.
I can match a specific string for "auth", for example "must": { "match": { "auth": "foo" } } returns all entries for auth = "foo". That is opposite of what I need, but it does work. The hostname is still not exactly matched if it includes a hyphen.
The log entries are parsed into Elasticsearch using ELK stack, however this will be a report that is generated outside of Kibana for legacy reasons.
I have read the documentation and examples, but there is a lot to dig through. Many of the examples I have found are for older versions of Elasticsearch, which is understandable, but confusing.
I am new to Elasticsearch. It feels like I am just overlooking something, but it the problem might stem from a basic misunderstanding of how Elasticsearch is doing things.

After spending some more time with ElascticSearch queries, I think I have it figured out.
Splitting the hostname string into two separate string and matching for both filters the hostname as expected. Using an empty string for the negative match also seems to work as expected.
Here is the updated query:
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
{
"match": {
"beat.hostname": "example"
}
},
{
"match": {
"beat.hostname": "dev"
}
}
],
"must_not": [
{
"match_phrase": {
"auth.keyword": ""
}
}
]
}
}
}
}
I will do bit more testing is need to make sure this is actually returning what I need.
I was trying too hard to make ElasticSearch fit what I expected. Instead of working with ElasticSearch, I was trying to fight against it.

Related

Can 'exists' be used to detect empty strings in ElasticSearch?

I thought this would be simple, but it is turning out to be quite complicated.
We want to be able to extract from our ElasticSearch instance empty and not empty fields. Strings cause the problem. My definitions of empty or not empty are:
Empty
It does not exist.
It does exist but the value is NULL or an empty string (for strings).
Not empty
It does exist.
It has a value that is not NULL or empty string (for strings).
And I have read about different ways to proceed, and all of them seem to involve a bit of complexity. The old missing filter, using a script portion on the query to compare with length 0, using term, etc. Implementing a should_not to mimic the logic described before does not seem to work either in my version.
Ideally, it would be fantastic if we could use the exists operator everywhere, as it could be used with all the types we have, date, integers, strings, etc.
There is something that I was assuming but that does not seem to be true at least in my case (using ElasticSearch 5.5.0):
"Elasticsearch does not index empty strings"
My understanding is that if this was true, we could use exists on that string field too. The queries are generated automatically by a module we wrote, so a simpler query would also simplify the coding of the new functionality. The same operator would be used in all cases.
I have tried to add keywords as a plain field:
...
:field {:type "keyword"}
...
And also nested:
{:type "text"
:analyzer "standard"
:fields {:raw {:type "keyword"}}}
But nothing seems to work:
{
"query": {
"bool": {
"must_not": [
{
"exists" : { "field.raw" : "x" }
}
...
...
],
All empty strings are detected as if they existed. Is there any change that we could implement to enable that?.
Empty string such as "" is considered as field exists. To identify if the field is empty as per your definition you can use the query as below:
{
"query": {
"bool": {
"should": [
{
"bool": {
"must_not": [
{
"exists": {
"field": "someField"
}
}
]
}
},
{
"term": {
"someField": ""
}
}
]
}
}
}
Replace someField in above query by the name of the actual field in your index.
It's also ok to use query_string:
"query_string": { "query": "someField":\"\"" }

Elasticsearch index field with wildcard and search for it

I have a document with a field "serial number". That serial number is ABC.XXX.DEF where XXX indicates wildcards. XXX can be \d{3}[a-zA-Z0-9].
So users can search for:
ABC.123.DEF
ABC.234.DEF
ABC.XYZ.DEF
while the document only includes
ABC.XXX.DEF
When a user queries ABC.123.DEF i need a hit on that document containing ABC.XXX.DEF. As other documents might contain ABC.DEF.XXX and must not be hit I am running out of ideas with my basic elasticsearch knowledge.
Do I have to attack the problem from the query side or when analyzing/tokenizing the pattern?
Can anyone give me an example how to approach that problem?
As long as serial number is well defined the first solution that comes to my mind is to split serial number into three parts ("part1", "part2" and "part3", for example) and index them as three separate fields. Parts consisting of wildcards should have special value or may not be indexed at all. Then at query time I would split serial number provided by user in the same way. Assuming that parts consisting of wildcards are not indexed my query would look like this:
"query": {
"bool": {
"must":[
{
"bool": {
"should": [
{
"match": {
"part1": "ABC"
}
},
{
"bool": {
"must_not": {
"exists": {
"field": "part1"
}
}
}
}
]
}
},
... // Similar code for other parts
]
}
}

Elastic search wildcard query crashes cluster

I run the query below on a large elastic search cluster. The cluster bcomes unresponsive
{
"size": 10000,
"query": {
"bool": {
"must": [
{
"regexp": {
"message": {
"value": ".*exception.*"
}
}
},
{
"bool": {
"should": [
{
"term": {
"beat.hostname": "ip-xxx-xx-xx-xx"
}
}
]
}
},
{
"range": {
"#timestamp": {
"lt": 1518459660000,
"format": "epoch_millis",
"gte": 1518459600000
}
}
}
]
}
}
}
When I remove the wildcarded .*exception.* and replace it with any non wildcarded string like xyz it returns fast. Though the query uses a wildcarded expression, it also looks for a small time range and a specific host. I would think this is a very simple query. Any reason why elasticsearch server can't handle this query? The cluster has 10 nodes and 20 TB of data.
See the documentation for Regexp Query. It clearly states the following:
Note: The performance of a regexp query heavily depends on the regular
expression chosen. Matching everything like .* is very slow
What would be ideal is to change the text analysis on the message field with a WordDelimiterTokenFilter and set split_on_case_change to true. Then something like NullPointerException will get indexed as three separate tokens [Null, Pointer, Exception]. This can help you search on exception without using a regex. Caveat is you need to reindex all your documents.
Another quick thing to try might be to keep your filter conditions on the hostname and timestamp in a filter context, which will prefilter documents before running your regexp query. This may be a short-term solution for you until you fix the text analysis.

How to use multifield search in elasticsearch combining should and must clause

This may be a repeted question but I'm not findin' a good solution.
I'm trying to search elasticsearch in order to get documents that contains:
- "event":"myevent1"
- "event":"myevent2"
- "event":"myevent3"
the documents must not contain all of them in the same document but the result should contain only documents that are only with those types of events.
And this is simple because elasticsearch helps me with the clause should
which returns exactly what i want.
But then, I want that all the documents must contain another condition that is I want the field result.example.example = 200 and this must be in every single document PLUS the document should be 1 of the previously described "event".
So, for example, a document has "event":"myevent1" and result.example.example = 200 another one has "event":"myevent2" and result.example.example = 200 etc etc.
I've tried this configuration:
{
"query": {
"bool": {
"must":{"match":{"operation.result.http_status":200}},
"should": [
{
"match": {
"event": "bank.account.patch"
}
},
{
"match": {
"event": "bank.account.add"
}
},
{
"match": {
"event": "bank.user.patch"
}
}
]
}
}
}
but is not working 'cause I also get documents that not contain 1 of the should field.
Hope I explained well,
Thanks in advance!
As is, your query tells ES to look for documents that must have "operation.result.http_status":200 and to boost those that have a matching event type.
You're looking to combine two must queries
one that matches one of your event types,
one for your other condition
The event clause accepts multiple values and those values are exact matches : you're looking for a terms query.
Try
{
"query": {
"bool": {
"must": [
{"match":{"operation.result.http_status":200}},
{
"terms" : {
"event" : [
"bank.account.patch",
"bank.account.add",
"bank.user.patch"
]
}
}
]
}
}
}

Case insensitivity does not work

I cant figure out why my searches are case sensitive. Everything I've read says that ES is insensitive by default. I have mappings that specify the standard analyzer for indexing and search but it seems like some things are still case sensitive - ie, wildcard:
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "Rae*"
}
}
}
]
}
This fails but "rae*" works as wanted. I need to use wildcard for 'starts-with' type searches (I presume).
I'm using NEST from a .Net app and am specifying the analyzers when I create the index thus:
var settings = new IndexSettings();
settings.NumberOfReplicas = _configuration.Replicas;
settings.NumberOfShards = _configuration.Shards;
settings.Add("index.refresh_interval", "10s");
settings.Analysis.Analyzers.Add(new KeyValuePair<string, AnalyzerBase>("keyword", new KeywordAnalyzer()));
settings.Analysis.Analyzers.Add(new KeyValuePair<string, AnalyzerBase>("simple", new SimpleAnalyzer()));
In this case it's using the simple analyzer but the standard one has the same result.
The mapping looks like this:
name: {
type: string
analyzer: simple
store: yes
}
Anyone got any ideas whats wrong here?
Thanks
From the documentation,
"[The wildcard query] matches documents that have fields matching a wildcard expression (not analyzed)".
Because the search term is not analyzed, you'll essentially need to run the analysis yourself before generating the search query. In this case, this just means that your search term needs to be lowercase. Alternatively, you could use query_string:
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "name:Rae*"
}
}
]
}
}
}

Resources