Using Regexp Search inside a must bool query vs using must_not bool query - elasticsearch

I want to make queries like - get all documents containing/not containing "some value" for a given field
-get all documents having value equal/not equal to "some value" for a given field.
As per my mapping the fields are String type meaning they support both keyword and full text search something like:
"myField" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
I was initially using regex matching like(this query is for not matches) :
"bool": {
"must":[
{
"regexp": {
"myField.keyword": {
"value": "~(some value)",
"flags": "ALL"
}
}
}
]
}
so, basically ~(word) for not, .*word.* for contains and ~(.*word.*) for not containing.
But, then also came across the 'must_not' bool query, so I understand I can also add a 'must_not' for the not equals cases clause along with the 'must' and 'should' clauses(for boolean AND and OR between other fields) in my bigger bool query, but still not sure about contains and not contains search, can someone definitively explain, what is the best practice here speaking both in terms of performance and accuracy of the result set returned.
ElasticSearch version used - Currently transitioning from v 6.3 to v 7.1.1

Related

Phrase suggester returns unexpected result when first letter is misspelled

I'm using Elasticsearch Phrase Suggester for correcting user's misspellings. everything is working as I expected unless user enters a query which it's first letter is misspelled. At this situation phrase suggester returns nothing or returns unexpected results.
My query for suggestion:
{
"suggest": {
"text": "user_query",
"simple_phrase": {
"phrase": {
"field": "title.phrase",,
"collate": {
"query": {
"inlile" : {
"bool": {
"should": [
{ "match": {"title": "{{suggestion}}"}},
{ "match": {"participants": "{{suggestion}}"}}
]
}
}
}
}
}
}
}
}
Example when first letter is misspelled:
"simple_phrase" : [
{
"text" : "گاشانچی",
"offset" : 0,
"length" : 11,
"options" : [ {
"text" : "گارانتی",
"score" : 0.00253151
}]
}
]
Example when fifth letter is misspelled:
"simple_phrase" : [
{
"text" : "کاشاوچی",
"offset" : 0,
"length" : 11,
"options" : [ {
"text" : "کاشانچی",
"score" : 0.1121
},
{
"text" : "کاشانجی",
"score" : 0.0021
},
{
"text" : "کاشنچی",
"score" : 0.0020
}]
}
]
I expect that these two misspelled queries have same suggestions(my expected suggestions are second one). what is wrong?
P.S: I'm using this feature for Persian language.
I have solution for your problem, only need to add some fields in your schema.
P.S: I don't have that much expertise in elasticsearch but I have solved same problem using solr, you can implement same way in elasticSearch too
Create new ngram field and copy all you title name in ngram field.
When you fire any query for missspell word and you get blank result then split
the word and again fire the same query you will get results as expected.
Example : Suppose user searching for word Akshay but type it as Skshay, then
create query in below way you will get results as expected hopefully.
I am here giving you solr example same way you can achieve it using
elasticsearch.
**(ngram:"skshay" OR ngram:"sk" OR ngram:"ks" OR ngram:"sh" OR ngram:"ha" ngram:"ay")**
We have split the word sequence wise and fire query on field ngram.
Hope it will help you.
From Elasticsearch doc:
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-suggesters-phrase.html
prefix_length
The number of minimal prefix characters that must match in order be a
candidate suggestions. Defaults to 1. Increasing this number improves
spellcheck performance. Usually misspellings don’t occur in the
beginning of terms. (Old name "prefix_len" is deprecated)
So by default phrase-suggester assumes that the first character is correct because the default value for prefix_length is 1.
Note: setting this value to 0 is not a good way because this will have performance implications.
You need to use the reverse analyzer
I explained it in this post so please go and check my answer
Elasticsearch spell check suggestions even if first letter missed
And regarding the duplicates, you can use
skip_duplicates
Whether duplicate suggestions should be filtered out (defaults to
false).

ElasticSearch filter on exact url

Let's say I create this document in my index:
put /nursery/rhyme/1
{
"url" : "http://example.com/mary",
"text" : "Mary had a little lamb"
}
Why does this query not return anything?
POST /nursery/rhyme/_search
{
"query" : {
"match_all" : {}
},
"filter" : {
"term" : {
"url" : "http://example.com/mary"
}
}
}
The Term Query finds documents that contain the exact term specified in the inverted index. When you save the document, the url property is analyzed and it will result in the following terms (with the default analyzer) : [http, example, com, mary].
So what you currently have in you inverted index is that bunch of terms, non of them is http://example.com/mary.
What you want is to not analyze the url property or to do a Match Query that will split the query into terms just like when indexing.
Exact Match does not work for analyzed field. A string is by default analyzed which means http://example.com/mary string will be split and stored in reverse index as http , example , com , mary. That's why your query results in no output.
You can make your field not analyzed
{
"url": {
"type": "string",
"index": "not_analyzed"
}
}
but for this you will have to reindex your index.
Study about not_analyzed and term query here.
Hope this helps
In the ElasticSearch 7.x you have to use type "keyword" in maping properties, which is not analized https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html

Elasticsearch: how to query a long field for exact match

My document has the following mapping property:
"sid" : {"type" : "long", "store": "yes", "index": "no"},
This property has only one value for each record. I would like to query this property. I tried the following queries:
{
"query" : {
"term" : {
"sid" : 10
}
}
}
{
"query" : {
"match" : {
"sid" : 10
}
}
}
However, I got no results. I do have a document with sid being euqal to 10. Anything I did is wrong? I would like to query this property for exact match.
Thanks and regards.
Quote from the documentation:
index: Set to analyzed for the field to be indexed and searchable after being
broken down into token using an analyzer. not_analyzed means that its
still searchable, but does not go through any analysis process or
broken down into tokens. no means that it won’t be searchable at all
(as an individual field; it may still be included in _all). Setting to
no disables include_in_all. Defaults to analyzed.
So, by setting index to no you cannot search by that field individually. So, you either need to remove no from index and choose something else or you can use "include_in_all":"yes" and use a different type of query:
"query": {
"match": {
"_all": 10
}
}

Elasticsearch doesn't return results

I am facing a strange issue in elasticsearch query. I don't know much about elasticsearch. My query is:
{
"query":
{
"bool":
{
"must":
[
{
"text":
{
"countryCode2":"DE"
}
}
],
"must_not":[],
"should":[]
}
},"from":0,"size":1,"sort":[],"facets":{}
}
The issues is for "DE". It is giving me results but for "BE" or "IN" it returns empty result.
You are indexing using the default mapping, which by default removes english stopwords. The country codes "IN", "BE", and many more are stopwords which don't even get indexed, therefore it's not possible to have matching documents, nor get back those country codes when faceting on that field.
The solution is to reindex after having submitted your own mapping for the country code field:
{
"your_type_name" : {
"country" : {
"type" : "string", "index" : "not_analyzed"
}
}
}
If you already tried to do this but nothing changed, the mapping didn't get submitted properly. I would suggest to double check that its json structure is correct and that you can actually get it back using the get mapping api.
As this is a common problem the defaults are probably going to change in the future to be less intrusive and avoid applying any language dependent text analysis.

Exact (not substring) matching in Elasticsearch

{"query":{
"match" : {
"content" : "2"
}
}} matches all the documents whole content contains the number 2, however I would like the content to be exactly 2, no more no less - think of my requirement in a spirit of Java's String.equals.
Similarly for the second query I would like to match when the document's content is exactly '3 3' and nothing more or less. {"query":{
"match" : {
"content" : "3 3"
}
}}
How could I do exact (String.equals) matching in Elasticsearch?
Without seeing your index type mapping and sample data, it's hard to answer this directly - but I'll try.
Offhand, I'd say this is similar to this answer here (https://stackoverflow.com/a/12867852/382774), where you simply set the content field's index option to not_analyzed in your mapping:
"url" : {
"type" : "string",
"index" : "not_analyzed"
}
Edit: I wasn't clear enough with my original answer, shown above. I did not mean to imply that you should add the example code to your query, I meant that you need to specify in your index type mapping that the url field is of type string and it is indexed but not analyzed (not_analyzed).
This tells Elasticsearch to not bother analyzing (tokenizing or token filtering) the field when you're indexing your documents - just store it in the index as it exists in the document. For more information on mappings, see http://www.elasticsearch.org/guide/reference/mapping/ for an intro and http://www.elasticsearch.org/guide/reference/mapping/core-types/ for specifics on not_analyzed (tip: search for it on that page).
Update:
Official doc tells us that in a new version of Elastic search you can't define variable as "not_analyzed", instead of this you should use "keyword".
For the old version elastic:
{
"foo": {
"type" "string",
"index": "not_analyzed"
}
}
For new version:
{
"foo": {
"type" "keyword",
"index": true
}
}
Note that this functionality (keyword type) are from elastic 5.0 and backward compatibility layer is removed from Elasticsearch 6.0 release.
Official Doc
You should use filter instead of match.
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"content" : 2
}
}
}
}
And you got docs whose content is exact 2, instead of 20 or 2.1

Resources