How do && and || work constructing queries in NEST? - elasticsearch

According to http://nest.azurewebsites.net/concepts/writing-queries.html, the && and || operators can be used to combine two queries using the NEST library to communicate with Elastic Search.
I have the following query set up:
var ssnQuery = Query<NameOnRecordDTO>.Match(
q => q.OnField(f => f.SocialSecurityNumber).QueryString(nameOnRecord.SocialSecurityNumber).Fuzziness(0)
);
which is then combined with a Bool query as shown below:
var result = client.Search<NameOnRecordDTO>(
body => body.Query(
query => query.Bool(
bq => bq.Should(
q => q.Match(
p => p.OnField(f => f.Name.First)
.QueryString(nameOnRecord.Name.First).Fuzziness(fuzziness)
),
q => q.Match(p => p.OnField(f => f.Name.Last)
.QueryString(nameOnRecord.Name.Last).Fuzziness(fuzziness)
)
).MinimumNumberShouldMatch(2)
) || ssnQuery
)
);
What I think this query means is that if the SocialSecurityNumber matches, or both the Name.First and Name.Last fields match, then the record should be included in the results.
When I execute this query with the follow data for the nameOnRecord object used in the calls to QueryString:
"socialSecurityNumber":"123456789",
"name" : {
"first":"ryan",
}
the results are the person with SSN 123456789, along with anyone with first name ryan.
If I remove the || ssnQuery from the query above, I get everyone whose first name is 'ryan'.
With the || ssnQuery in place and the following query:
{
"socialSecurityNumber":"123456789",
"name" : {
"first":"ryan",
"last": "smith"
}
}
I appear to get the person with SSN 123456789 along with people whose first name is 'ryan' or last name is 'smith'.
So it does not appear that adding || ssnQuery is having the effect that I expected, and I don't know why.
Here is the definition of the index on object in question:
"nameonrecord" : {
"properties": {
"name": {
"properties": {
"name.first": {
"type": "string"
},
"name.last": {
"type": "string"
}
}
},
"address" : {
"properties": {
"address.address1": {
"type": "string",
"index_analyzer": "address",
"search_analyzer": "address"
},
"address.address2": {
"type": "string",
"analyzer": "address"
},
"address.city" : {
"type": "string",
"analyzer": "standard"
},
"address.state" : {
"type": "string",
"analyzer": "standard"
},
"address.zip" : {
"type" : "string",
"analyzer": "standard"
}
}
},
"otherName": {
"type": "string"
},
"socialSecurityNumber" : {
"type": "string"
},
"contactInfo" : {
"properties": {
"contactInfo.phone": {
"type": "string"
},
"contactInfo.email": {
"type": "string"
}
}
}
}
}
I don't think the definition of the address analyzer is important, since the address fields are not being used in the query, but can include it if someone wants to see it.

This was in fact a bug in NEST
A precursor to how NEST helps translate boolean queries:
NEST allows you to use operator overloading to create verbose bool queries/filters easily i.e:
term && term will result in:
bool
must
term
term
A naive implementation of this would rewrite
term && term && term to
bool
must
term
bool
must
term
term
As you can image this becomes unwieldy quite fast the more complex a query becomes NEST can spot these and join them together to become
bool
must
term
term
term
Likewise term && term && term && !term simply becomes:
bool
must
term
term
term
must_not
term
now if in the previous example you pass in a booleanquery directly like so
bool(must=term, term, term) && !term
it would still generate the same query. NEST will also do the same with should's when it sees that the boolean descriptors in play ONLY consist of should clauses. This is because the boolquery does not quite follow the same boolean logic you expect from a programming language.
To summarize the latter:
term || term || term
becomes
bool
should
term
term
term
but
term1 && (term2 || term3 || term4) will NOT become
bool
must
term1
should
term2
term3
term4
This is because as soon as a boolean query has a must clause the should start acting as a boosting factor. So in the previous you could get back results that ONLY contain term1 this is clearly not what you want in the strict boolean sense of the input.
NEST therefor rewrites this query to
bool
must
term1
bool
should
term2
term3
term4
Now where the bug came into play was that your situation you have this
bool(should=term1, term2, minimum_should_match=2) || term3 NEST identified both sides of the OR operation only contains should clauses and it would join them together which would give a different meaning to the minimum_should_match parameter of the first boolean query.
I just pushed a fix for this and this will be fixed in the next release 0.11.8.0
Thanks for catching this one!

Related

How can I script a field in Kibana that matches first 4 digits of a field?

I'm having a steep learning curve with the syntax and my data has PII so I don't know how to describe more.
I need a new field in kibana in the already indexed documents. This field "C" would be a combination of the first 4 digits of a field "A" that contains numbers up to millions and is of type:keyword, and a field "B" that is type:keyword and is some large number.
Later I will use this field "C" that is a unique combination, to compare with a list/array of items ( I will insert the list in a query DSL in Kibana, as I need to build some visualizations and reports with the returned documents).
I saw that I could use painless to create this new field, but I don't know exactly if I need to use regex and how to.
EDIT:
As requested more info about the mapping with a concrete example.
"fieldA" : {
"type: "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"fieldB" : {
"type: "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
Example of values:
FieldA = "9876443320134",
FieldB = "000000001".
I would like to sum the first 4 digits of FieldA and the full content of FieldB. FieldC would result in a value of "9877".
The raw query could look like this:
GET combination_index/_search
{
"script_fields": {
"a+b": {
"script": {
"source": """
def a = doc['fieldA.keyword'].value;
def b = doc['fieldB.keyword'].value;
if (a == null || b == null) {
return null
}
def parsed_a = new BigInteger(a);
def parsed_b = new BigInteger(b);
return new BigInteger(parsed_a.toString().substring(0, 4)) + parsed_b;
"""
}
}
}
}
Note 1: we're parsing the strings into BigInteger b/c of seemingly insufficient Integer.MAX_VALUE.
Note 2: we're first parsing fieldA and only then calling .toString on it again in order to handle the edge case of fieldA starting w/ 0s like 009876443320134. It's assumed that you're looking for 9876, not 98, which be the result of first calling .substring and then parsing.
If you intend to use it in Kibana visualizations, you'll need an index pattern first. Once you've got one, you can proceed as follows:
then put the script in:
click save and the new scripted becomes available in numeric aggregations and queries:

Elastic query bool must match issue

Below is the query part in Elastic GET API via command line inside openshift pod , i get all the match query as well as unmatch element in the fetch of 2000 documents. how can i limit to only the match element.
i want to specifically get {\"kubernetes.container_name\":\"xyz\"}} only.
any suggestions will be appreciated
-d ' {\"query\": { \"bool\" :{\"must\" :{\"match\" :{\"kubernetes.container_name\":\"xyz\"}},\"filter\" : {\"range\": {\"#timestamp\": {\"gte\": \"now-2m\",\"lt\": \"now-1m\"}}}}},\"_source\":[\"#timestamp\",\"message\",\"kubernetes.container_name\"],\"size\":2000}'"
For exact matches there are two things you would need to do:
Make use of Term Queries
Ensure that the field is of type keyword datatype.
Text datatype goes through Analysis phase.
For e.g. if you data is This is a beautiful day, during ingestion, text datatype would break down the words into tokens, lowercase them [this, is, a, beautiful, day] and then add them to the inverted index. This process happens via Standard Analyzer which is the default analyzer applied on text field.
So now when you query, it would again apply the analyzer at querying time and would search if the words are present in the respective documents. As a result you see documents even without exact match appearing.
In order to do an exact match, you would need to make use of keyword fields as it does not goes through the analysis phase.
What I'd suggest is to create a keyword sibling field for text field that you have in below manner and then re-ingest all the data:
Mapping:
PUT my_sample_index
{
"mappings": {
"properties": {
"kubernetes":{
"type": "object",
"properties": {
"container_name": {
"type": "text",
"fields":{ <--- Note this
"keyword":{ <--- This is container_name.keyword field
"type": "keyword"
}
}
}
}
}
}
}
}
Note that I'm assuming you are making use of object type.
Request Query:
POST my_sample_index
{
"query":{
"bool": {
"must": [
{
"term": {
"kubernetes.container_name.keyword": {
"value": "xyz"
}
}
}
]
}
}
}
Hope this helps!

Using Regexp Search inside a must bool query vs using must_not bool query

I want to make queries like - get all documents containing/not containing "some value" for a given field
-get all documents having value equal/not equal to "some value" for a given field.
As per my mapping the fields are String type meaning they support both keyword and full text search something like:
"myField" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
I was initially using regex matching like(this query is for not matches) :
"bool": {
"must":[
{
"regexp": {
"myField.keyword": {
"value": "~(some value)",
"flags": "ALL"
}
}
}
]
}
so, basically ~(word) for not, .*word.* for contains and ~(.*word.*) for not containing.
But, then also came across the 'must_not' bool query, so I understand I can also add a 'must_not' for the not equals cases clause along with the 'must' and 'should' clauses(for boolean AND and OR between other fields) in my bigger bool query, but still not sure about contains and not contains search, can someone definitively explain, what is the best practice here speaking both in terms of performance and accuracy of the result set returned.
ElasticSearch version used - Currently transitioning from v 6.3 to v 7.1.1

can we mandate elastic search to treat all numeric field as double

I am using dynamic binding while indexing my data. For example
{ "a" : 10 }
will create the mapping for the field as long . While second time while indexing the data may be double { "a" : 10.10 }. but since the mapping is already defined as long it would index data as long. The only way to fix this is defined the mapping in advance, which I dont want to do for various reasons.
So my question - Is there a way I can mandate elastic search to treat all numberic field as double.
You can use dynamic mapping template: https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html
If it matches as long map it to double:
PUT my_index
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"integers": {
"match_mapping_type": "long",
"mapping": {
"type": "double"
}
}
}
]
}
}
}

How to match parts of a word in elasticsearch?

How can I match parts of a word to the parent word ?. For example: I need to match "eese" or "heese" to the word "cheese".
The best way to achieve this is using an edgeNGram token filter combined with two reverse token filters. So, first you need to define a custom analyzer called reverse_analyzer in your index settings like below. Then you can see that I've declared a string field called your_field with a sub-field called suffix which has our custom analyzer defined.
PUT your_index
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase", "reverse", "substring", "reverse"]
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"your_field": {
"type": "string",
"fields": {
"suffix": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
}
Then you can index a test document with "cheese" inside, like this:
PUT your_index/your_type/1
{"your_field": "cheese"}
When this document is indexed, the your_field.suffix field will contain the following tokens:
e
se
ese
eese
heese
cheese
Under the hood what is happening when indexing cheese is the following:
The keyword tokenizer will tokenize a single token, => cheese
The lowercase token filter will put the token in lowercase => cheese
The reverse token filter will reverse the token => eseehc
The substring token filter will produce different tokens of length 1 to 10 => e, es, ese, esee, eseeh, eseehc
Finally, the second reverse token filter will reverse again all tokens => e, se, ese, eese, heese, cheese
Those are all the tokens that will be indexed
So we can finally search for eese (or any suffix of cheese) in that sub-field and find our match
POST your_index/_search
{
"query": {
"match": {
"your_field.suffix": "eese"
}
}
}
=> Yields the document we've just indexed above.
You can do it two ways:
If you need it happen only for some search then search box only you can pass
*eese* or *heese*
Just give * in beginning and end of your search word. If you need it for every search
string "*#{params[:query]}*"
this will match with your parent word and give the result
There are multiple ways to do this
The analyzer approach - Here you Ngram tokenizer to break sub tokens of all the words. Hence for the word "cheese" -> [ "chee" , "hees" , "eese" , "cheese" ] and all ind of substrings would be generated. With this index size will go high , but the search speed would be optimized
The wildcard query approach - In this approach , a scan happens on the reverse index. This does not occupy additional index size , but it will take more time on the search.

Resources