Can 'exists' be used to detect empty strings in ElasticSearch? - elasticsearch

I thought this would be simple, but it is turning out to be quite complicated.
We want to be able to extract from our ElasticSearch instance empty and not empty fields. Strings cause the problem. My definitions of empty or not empty are:
Empty
It does not exist.
It does exist but the value is NULL or an empty string (for strings).
Not empty
It does exist.
It has a value that is not NULL or empty string (for strings).
And I have read about different ways to proceed, and all of them seem to involve a bit of complexity. The old missing filter, using a script portion on the query to compare with length 0, using term, etc. Implementing a should_not to mimic the logic described before does not seem to work either in my version.
Ideally, it would be fantastic if we could use the exists operator everywhere, as it could be used with all the types we have, date, integers, strings, etc.
There is something that I was assuming but that does not seem to be true at least in my case (using ElasticSearch 5.5.0):
"Elasticsearch does not index empty strings"
My understanding is that if this was true, we could use exists on that string field too. The queries are generated automatically by a module we wrote, so a simpler query would also simplify the coding of the new functionality. The same operator would be used in all cases.
I have tried to add keywords as a plain field:
...
:field {:type "keyword"}
...
And also nested:
{:type "text"
:analyzer "standard"
:fields {:raw {:type "keyword"}}}
But nothing seems to work:
{
"query": {
"bool": {
"must_not": [
{
"exists" : { "field.raw" : "x" }
}
...
...
],
All empty strings are detected as if they existed. Is there any change that we could implement to enable that?.

Empty string such as "" is considered as field exists. To identify if the field is empty as per your definition you can use the query as below:
{
"query": {
"bool": {
"should": [
{
"bool": {
"must_not": [
{
"exists": {
"field": "someField"
}
}
]
}
},
{
"term": {
"someField": ""
}
}
]
}
}
}
Replace someField in above query by the name of the actual field in your index.

It's also ok to use query_string:
"query_string": { "query": "someField":\"\"" }

Related

Elasticsearch index field with wildcard and search for it

I have a document with a field "serial number". That serial number is ABC.XXX.DEF where XXX indicates wildcards. XXX can be \d{3}[a-zA-Z0-9].
So users can search for:
ABC.123.DEF
ABC.234.DEF
ABC.XYZ.DEF
while the document only includes
ABC.XXX.DEF
When a user queries ABC.123.DEF i need a hit on that document containing ABC.XXX.DEF. As other documents might contain ABC.DEF.XXX and must not be hit I am running out of ideas with my basic elasticsearch knowledge.
Do I have to attack the problem from the query side or when analyzing/tokenizing the pattern?
Can anyone give me an example how to approach that problem?
As long as serial number is well defined the first solution that comes to my mind is to split serial number into three parts ("part1", "part2" and "part3", for example) and index them as three separate fields. Parts consisting of wildcards should have special value or may not be indexed at all. Then at query time I would split serial number provided by user in the same way. Assuming that parts consisting of wildcards are not indexed my query would look like this:
"query": {
"bool": {
"must":[
{
"bool": {
"should": [
{
"match": {
"part1": "ABC"
}
},
{
"bool": {
"must_not": {
"exists": {
"field": "part1"
}
}
}
}
]
}
},
... // Similar code for other parts
]
}
}

How do I not match a bare hyphen in Elasticsearch?

I am querying apache logs stored in Elasticsearch. I want to return log entries from a given hostname that has a hyphen and with a populated auth field.
These strings should be an exact match: "hostname": "example-dev" and not "auth": "-".
My questions are:
How do I correctly remap a type in Elasticsearch to allow a hyphen to be part of the matched string.
How do I correctly query a type in Elasticsearch with a bare hyphen.
The hyphen is a reserved character in Elasticsearch, so I understand it takes special effort. However, I'm having what seems like a lot of trouble figuring out how to include it in my query.
I have tried to remap the type to be not_analysed. It looks like the format has recently changed. The old way of defining the index ("analysed", "not_analysed", and "no") makes sense to me. The new way (true or false) does not. In either case, I cannot seem to get remapping to work.
Here is my attempt at remapping:
DELETE /search
PUT search
{
"mappings" : {
"beat" : {
"properties" : {
"hostname" : {
"type" : "text",
"norms" : false,
"index" : false
}
}
}
}
}
I have not included the remapping of the auth field because it only returns a mapper_parsing_exception.
I am using json to query Elasticsearch. Here is my query:
GET _search
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
{
"match": {
"beat.hostname": "example-dev"
}
}
],
"must_not": [
{
"match": {
"auth.keyword": "-"
}
}
]
}
}
}
}
}
I have tried escaping the hyphen with \\- but that returns results that match "auth": "-". The hostname still does not match exactly. The hostname query also matches something like "example-prod".
I have tried using "term" rather than "match"; that returns no results.
I can match a specific string for "auth", for example "must": { "match": { "auth": "foo" } } returns all entries for auth = "foo". That is opposite of what I need, but it does work. The hostname is still not exactly matched if it includes a hyphen.
The log entries are parsed into Elasticsearch using ELK stack, however this will be a report that is generated outside of Kibana for legacy reasons.
I have read the documentation and examples, but there is a lot to dig through. Many of the examples I have found are for older versions of Elasticsearch, which is understandable, but confusing.
I am new to Elasticsearch. It feels like I am just overlooking something, but it the problem might stem from a basic misunderstanding of how Elasticsearch is doing things.
After spending some more time with ElascticSearch queries, I think I have it figured out.
Splitting the hostname string into two separate string and matching for both filters the hostname as expected. Using an empty string for the negative match also seems to work as expected.
Here is the updated query:
{
"query": {
"bool": {
"filter": {
"bool": {
"must": [
{
"match": {
"beat.hostname": "example"
}
},
{
"match": {
"beat.hostname": "dev"
}
}
],
"must_not": [
{
"match_phrase": {
"auth.keyword": ""
}
}
]
}
}
}
}
I will do bit more testing is need to make sure this is actually returning what I need.
I was trying too hard to make ElasticSearch fit what I expected. Instead of working with ElasticSearch, I was trying to fight against it.

How to use multifield search in elasticsearch combining should and must clause

This may be a repeted question but I'm not findin' a good solution.
I'm trying to search elasticsearch in order to get documents that contains:
- "event":"myevent1"
- "event":"myevent2"
- "event":"myevent3"
the documents must not contain all of them in the same document but the result should contain only documents that are only with those types of events.
And this is simple because elasticsearch helps me with the clause should
which returns exactly what i want.
But then, I want that all the documents must contain another condition that is I want the field result.example.example = 200 and this must be in every single document PLUS the document should be 1 of the previously described "event".
So, for example, a document has "event":"myevent1" and result.example.example = 200 another one has "event":"myevent2" and result.example.example = 200 etc etc.
I've tried this configuration:
{
"query": {
"bool": {
"must":{"match":{"operation.result.http_status":200}},
"should": [
{
"match": {
"event": "bank.account.patch"
}
},
{
"match": {
"event": "bank.account.add"
}
},
{
"match": {
"event": "bank.user.patch"
}
}
]
}
}
}
but is not working 'cause I also get documents that not contain 1 of the should field.
Hope I explained well,
Thanks in advance!
As is, your query tells ES to look for documents that must have "operation.result.http_status":200 and to boost those that have a matching event type.
You're looking to combine two must queries
one that matches one of your event types,
one for your other condition
The event clause accepts multiple values and those values are exact matches : you're looking for a terms query.
Try
{
"query": {
"bool": {
"must": [
{"match":{"operation.result.http_status":200}},
{
"terms" : {
"event" : [
"bank.account.patch",
"bank.account.add",
"bank.user.patch"
]
}
}
]
}
}
}

Constructing a NEST/ElasticSearch query with nested properties

I'm querying an ElasticSearch database (the Danish CVR registry) using NEST in C#. I'm trying to formulate a query that will query this scheme:
relations: [
{
participant: {
key: 123123
},
organisations: [
{
organisationName: {
name: "some string",
period: {
from: "SOME DATE"
to: "SOMEDATE OR NULL"
}
},
... more of similar objects ..
}
]
},
.. more of similar objects ..
]
My problem here is that I need to find documents that have a certain participant.key value, while at the same time has a specific organisations.organisationName.name and a missing or null value in organisations.organisationName.period.to
I know I need to use a nested query to get documents that have both a null value in the to field and a certain name in the name field, but on top of that I need to also have the specific key in the particiant.key field, and this is where I'm having trouble. Note that all 3 fields that I'm checking must be within the same relations object, and the to and name fields must be within the same organisationName object.
The query without the key part as a JSON query is this:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "relations.organisations.organisationName",
"score_mode": "max",
"query": {
"bool": {
"must": [
{ "match": { "relations.organisations.organisationName.name": "EJERREGISTER" }},
{"filtered": { "filter" : {
"missing" : { "field" : "relations.organisations.organisationName.period.to" }
} } }
]
}}}}
]
}}}
Hoping someone out there is apt at making these queries in the NEST Query DSL. I could also work from a pure ElasticSearch JSON query, but the .NET equivalent would be my preferred option :)
Thanks in advance!
After some experimentation I came to the conclusion that the right answer to my problem would be a query with a nested query that 1. Checks the key, and 2. has a nested query that does the other things I needed in organisation.organisationName object.
I couldn't quite verify this, however, because the database I'm querying does not have the relations-object marked as nested (and I can't change that since it's a government database)
My workaround was to retrieve all relations related to my keys, and then filtering out the remaining objects in memory, as this wasn't too much overhead in my scenario.
Edit: as a follow up, the external database I was using added the nested clause, and it worked as explained above.

Filter items which array contains any of given values

I have a set of documents like
{
tags:['a','b','c']
// ... a bunch properties
}
As stated in the title: Is there a way to filter all documents containing any of given tags using Nest ?
For instance, the record above would match ['c','d']
Or should I build multiple "OR"s manually ?
elasticsearch 2.0.1:
There's also terms query which should save you some work. Here example from docs:
{
"terms" : {
"tags" : [ "blue", "pill" ],
"minimum_should_match" : 1
}
}
Under hood it constructs boolean should. So it's basically the same thing as above but shorter.
There's also a corresponding terms filter.
So to summarize your query could look like this:
{
"filtered": {
"query": {
"match": { "title": "hello world" }
},
"filter": {
"terms": {
"tags": ["c", "d"]
}
}
}
}
With greater number of tags this could make quite a difference in length.
Edit: The bitset stuff below is maybe an interesting read, but the answer itself is a bit dated. Some of this functionality is changing around in 2.x. Also Slawek points out in another answer that the terms query is an easy way to DRY up the search in this case. Refactored at the end for current best practices. —nz
You'll probably want a Bool Query (or more likely Filter alongside another query), with a should clause.
The bool query has three main properties: must, should, and must_not. Each of these accepts another query, or array of queries. The clause names are fairly self-explanatory; in your case, the should clause may specify a list filters, a match against any one of which will return the document you're looking for.
From the docs:
In a boolean query with no must clauses, one or more should clauses must match a document. The minimum number of should clauses to match can be set using the minimum_should_match parameter.
Here's an example of what that Bool query might look like in isolation:
{
"bool": {
"should": [
{ "term": { "tag": "c" }},
{ "term": { "tag": "d" }}
]
}
}
And here's another example of that Bool query as a filter within a more general-purpose Filtered Query:
{
"filtered": {
"query": {
"match": { "title": "hello world" }
},
"filter": {
"bool": {
"should": [
{ "term": { "tag": "c" }},
{ "term": { "tag": "d" }}
]
}
}
}
}
Whether you use Bool as a query (e.g., to influence the score of matches), or as a filter (e.g., to reduce the hits that are then being scored or post-filtered) is subjective, depending on your requirements.
It is generally preferable to use Bool in favor of an Or Filter, unless you have a reason to use And/Or/Not (such reasons do exist). The Elasticsearch blog has more information about the different implementations of each, and good examples of when you might prefer Bool over And/Or/Not, and vice-versa.
Elasticsearch blog: All About Elasticsearch Filter Bitsets
Update with a refactored query...
Now, with all of that out of the way, the terms query is a DRYer version of all of the above. It does the right thing with respect to the type of query under the hood, it behaves the same as the bool + should using the minimum_should_match options, and overall is a bit more terse.
Here's that last query refactored a bit:
{
"filtered": {
"query": {
"match": { "title": "hello world" }
},
"filter": {
"terms": {
"tag": [ "c", "d" ],
"minimum_should_match": 1
}
}
}
}
Whilst this an old question, I ran into this problem myself recently and some of the answers here are now deprecated (as the comments point out). So for the benefit of others who may have stumbled here:
A term query can be used to find the exact term specified in the reverse index:
{
"query": {
"term" : { "tags" : "a" }
}
From the documenation https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html
Alternatively you can use a terms query, which will match all documents with any of the items specified in the given array:
{
"query": {
"terms" : { "tags" : ["a", "c"]}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html
One gotcha to be aware of (which caught me out) - how you define the document also makes a difference. If the field you're searching in has been indexed as a text type then Elasticsearch will perform a full text search (i.e using an analyzed string).
If you've indexed the field as a keyword then a keyword search using a 'non-analyzed' string is performed. This can have a massive practical impact as Analyzed strings are pre-processed (lowercased, punctuation dropped etc.) See (https://www.elastic.co/guide/en/elasticsearch/guide/master/term-vs-full-text.html)
To avoid these issues, the string field has split into two new types: text, which should be used for full-text search, and keyword, which should be used for keyword search. (https://www.elastic.co/blog/strings-are-dead-long-live-strings)
For those looking at this in 2020, you may notice that accepted answer is deprecated in 2020, but there is a similar approach available using terms_set and minimum_should_match_script combination.
Please see the detailed answer here in the SO thread
You should use Terms Query
{
"query" : {
"terms" : {
"tags" : ["c", "d"]
}
}
}

Resources