Subwordsearch in elasticsearch using ngram does not work - elasticsearch

I would like to perform a simple_query_string search in Elasticsearch while having a sub-word matching.
For example if a would have a filename: "C:\Users\Sven Onderbeke\Documents\Arduino"
Than I would want this filename listed if my searchterm is for example "ocumen".
This thread suggested to use ngram to match with parts of the word. I tried to implement it as follows (in Python) but I get zero results while I expect one:
test_mapping = {
"properties": {
"filename": {
"type": "text",
"analyzer": "my_index_analyzer"
},
}
}
def create_index(index_name, mapping):
created = False
# index settings
settings = {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
},
"analysis": {
"index_analyzer": {
"my_index_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"mynGram"
]
}
},
"search_analyzer": {
"my_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"mynGram"
]
}
},
"filter": {
"mynGram": {
"type": "nGram",
"min_gram": 2,
"max_gram": 50
}
}
},
"mappings": mapping
}
try:
if not es.indices.exists(index_name):
# Ignore 400 means to ignore "Index Already Exist" error.
es.indices.create(index=index_name, ignore=400, body=settings)
print(f'Created Index: {index_name}')
created = True
except Exception as ex:
print(str(ex))
finally:
return created
create_index("test", test_mapping)
doc = {
'filename': r"C:\Users\Sven Onderbeke\Documents\Arduino",
}
es.index(index="test", document=doc)
needle = "ocumen"
q = {
"simple_query_string": {
"query": needle,
"default_operator": "and"
}
}
res = es.search(index="test", query=q)
print(res)
for hit in res['hits']['hits']:
print(hit)

The reason your solution isn't working is because you haven't provided analyzer on the property named as field while defining mapping. Update mapping as below and then reindex all documents.
test_mapping = {
"properties": {
"filename": {
"type": "text",
"analyzer": "my_index_analyzer"
},
}
}

Related

how to require minimum length letters from query to match in elasticsearch

I want to require that the query be at least 5 matching consecutive characters for matching a particular field. They can be somewhat fuzzy (would be ideal if the longer the sequence is, the fuzzier it can be).
In this example I defined n-gram with no min 5 characters in gram. That way it is possible to match with at least 5 characters.
PUT teste
{
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "shingle_analyzer"
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter"
]
}
},
"filter": {
"shingle_filter": {
"type": "edge_ngram",
"min_gram": 5,
"max_gram": 8
}
}
}
}
}
POST teste/_doc
{
"name":"example text match fiver terms sequence"
}
GET teste/_search
{
"query": {
"match": {
"name.ngram": "exampl"
}
}
}

Elastic search partial substring search

I am trying to implement partial substring search in elastic serach 7.1 using following analyzer
PUT my_index-001
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete"
]
},
"autocomplete_search": {
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
},
"filter": {
"autocomplete": {
"type": "nGram",
"min_gram": 2,
"max_gram": 40
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
After that i tried adding some sample data to my_index-001 and type doc
PUT my_index-001/doc/1
{
"title": "ABBOT Series LTD 2014"
}
PUT my_index-001/doc/2
{
"title": "ABBOT PLO LTD 2014A"
}
PUT my_index-001/doc/3
{
"title": "ABBOT TXT"
}
PUT my_index-001/doc/4
{
"title": "ABBOT DMO LTD. 2016-II"
}
Query used to perform partial search :
GET my_index-001/_search
{
"query": {
"match": {
"title": {
"query": "ABB",
"operator": "or"
}
}
}
}
I was expecting the following output from the analyzer
If i type in ABB i should get docid 1,2,3,4
If i type in ABB 2014 i should get docid 1,2
IF i type in ABBO PLO i should get doc 2
If i type in TXT i should get doc 3
With the above analyzer setting i am not getting expected results .
Please let me know if i am missing anything in my analyzer setting of Elastic search
You were almost there but there are a couple of issues.
When creating index mappings through Kibana Dev Tools, there mustn't be any whitespace between the URI and the request body. You have whitespace in the first code snippet which caused ES to ignore the request body entirely! So remove that whitespace.
The maximum ngram difference is set to 1 by default. In order to use your high ngram intervals, you'll need to explicitly increase the index-level setting max_ngram_diff:
PUT my_index-001
{
"settings": {
"index": {
"max_ngram_diff": 40 <--
},
...
}
}
Type names are deprecated in v7. So is the nGram token filter in favor of ngram (lowercase g). And so is the string field type too! Here's the corrected PUT request body:
PUT my_index-001 <--- no whitespace after the URI!
{
"settings": {
"index": {
"max_ngram_diff": 40 <--- explicit setting
},
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete"
]
},
"autocomplete_search": {
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
},
"filter": {
"autocomplete": {
"type": "ngram", <--- ngram, not nGram
"min_gram": 2,
"max_gram": 40
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text", <--- text, not string
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
Since different mapping types had been deprecated in favor of the generic _doc type, you'll need to adjust the way you insert documents. The only difference, luckily, is changing doc to _doc in the URI:
PUT my_index-001/_doc/1
{ "title": "ABBOT Series LTD 2014" }
PUT my_index-001/_doc/2
{ "title": "ABBOT PLO LTD 2014A" }
PUT my_index-001/_doc/3
{ "title": "ABBOT TXT" }
PUT my_index-001/_doc/4
{ "title": "ABBOT DMO LTD. 2016-II" }
Finally, your query is perfectly fine and should behave the way you expect it to. The only thing to change is the operator to and when querying for two or more substrings, i.e.:
GET my_index-001/_search
{
"query": {
"match": {
"title": {
"query": "ABB 2014",
"operator": "and"
}
}
}
}
Other than that, all four of your test scenarios should return what you expect.

Elasticsearch index analyzers seem to do nothing after being added

New to ES and following the docs (https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html) on using different analzers to deal with human language. After following some of the examples, it appears as though the added analyzers are having no effect on searches at all. Eg.
## init some index for testing
PUT /testindex
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 3,
"analysis": {},
"refresh_interval": "1s"
},
"mappings": {
"testtype": {
"properties": {
"title": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
## adding some analyzers for...
POST /testindex/_close
##... simple lowercase tokenization, ...(https://www.elastic.co/guide/en/elasticsearch/guide/current/lowercase-token-filter.html#lowercase-token-filter)
PUT /testindex/_settings
{
"analysis": {
"analyzer": {
"my_lowercaser": {
"tokenizer": "standard",
"filter": [ "lowercase" ]
}
}
}
}
## ... normalization (https://www.elastic.co/guide/en/elasticsearch/guide/current/algorithmic-stemmers.html#_using_an_algorithmic_stemmer), ...
PUT testindex/_settings
{
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"light_english_stemmer": {
"type": "stemmer",
"language": "light_english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"light_english_stemmer",
"asciifolding"
]
}
}
}
}
## ... and using a hunspell dictionary (https://www.elastic.co/guide/en/elasticsearch/guide/current/hunspell.html#hunspell)
PUT testindex/_settings
{
"analysis": {
"filter": {
"en_US": {
"type": "hunspell",
"language": "en_US"
}
},
"analyzer": {
"en_US": {
"tokenizer": "standard",
"filter": [
"lowercase",
"en_US"
]
}
}
}
}
POST /testindex/_open
GET testindex/_settings
## it appears as though the analyzers have been added without problem
## adding some testing data
POST /testindex/testtype
{
"title": "Will the root word of movement be found?"
}
POST /testindex/testtype
{
"title": "That's why I never want to hear you say, ehhh I waant it thaaat away."
}
## expecting to match against root word of movement (move)
GET /testindex/testtype/_search
{
"query": {
"match": {
"title": "moving"
}
}
}
## which returns 0 hits, as shown below
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
## ... yet I can see that the record expected does in fact exist in the index when using...
GET /testindex/testtype/_search
{
"query": {
"match_all": {}
}
}
Thinking then that I need to actually "add" the analyzer to a (new) field, I do the following (which still shows negative results)
# adding the analyzers to a new field
POST /testindex/testtype
{
"mappings": {
"properties": {
"title2": {
"type": "text",
"analyzer": [
"my_lowercaser",
"english",
"en_US"
]
}
}
}
}
# looking at the tokens I'd expect to be able to find
GET /testindex/_analyze
{
"analyzer": "en_US",
"text": "Moving between directories"
}
# moving, move, between, directory
# what I actually see
GET /testindex/_analyze
{
"field": "title2",
"text": "Moving between directories"
}
# moving, between, directories
Even trying something simpler like
POST /testindex/testtype
{
"mappings": {
"properties": {
"title2": {
"type": "text",
"analyzer": "en_US"
}
}
}
}
does not help at all.
So this seems very messed up. Am I missing something here about how these analyzers are supposed to work? Should these analyzers be working properly (based on the provided info) and I am simply misusing them here? If so, could someone please provide an example query that would actually work/hit?
** Is there other debugging information that should be added here?
title2 field has 3 analyzers, but according to your output(analyze endpoint) it seems that only my_lowercaser is applied.
Finally, the config that worked for me with hunspell is:
"settings": {
"analysis": {
"filter": {
"en_US": {
"type": "hunspell",
"language": "en_US"
}
},
"analyzer": {
"en_US": {
"tokenizer": "standard",
"filter": [ "lowercase", "en_US" ]
}
}
}
}
"mappings": {
"_doc": {
"properties": {
"title-en-us": {
"type": "text",
"analyzer": "en_US"
}
}
}
}
movement is not resolved to move while moving is(probably hunspell dictionary related). Querying with move resulted in docs with moving only, but not movement.

ElasticSearch Reverse Wildcard Search

In ElasticSearch v5.2.2 I can search for "Jo*" using Wildcard and it will match the index value containing "Joseph"
But what if my index also has these values "Joseph","Jo", "Jos", "Jose" and "Josep" and I want to reverse the query.
How can I find "Jo", "Jos", "Jose" and "Josep" in the index using the string "Joseph" as search criteria?
That's possible, but you need to create an edgeNGram search analyzer in your index settings.
First create the settings like this. The name field will be indexed with the standard analyzer but searched with your custom prefix_search analyzer instead.
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"prefix_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"prefix"
]
}
},
"filter": {
"prefix": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"analyzer": "standard",
"search_analyzer": "prefix_search"
}
}
}
}
}
Then if you create a document like this:
PUT test/doc/1
{
"name": "Jos"
}
You can find it with a query like this one:
POST /test/doc/_search
{
"query": {
"match": {
"name": "Joseph"
}
}
}

ngrams ins elasticsearch are not working

I use elasticsearch ngram
"analysis": {
"filter": {
"desc_ngram": {
"type": "ngram",
"min_gram": 3,
"max_gram": 8
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "desc_ngram", "lowercase" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
And I have 2 objects here
{
"name": "Shana Calandra",
"username": "shacalandra",
},
{
"name": "Shana Launer",
"username": "shalauner",
},
And using this query
{
query: {
match: {
_all: "Shana"
}
}
}
When I search with this query, it returns me both documents, but I cant search by part of word here, for example I cant use "Shan" instead of "Shana" in query because it doesnt return anything.
Maybe my mapping is wrong, I cant understand problem is on mapping or on query
If you specify
"mappings": {
"test": {
"_all": {
"index_analyzer": "index_ngram",
"search_analyzer": "search_ngram"
},
for your mapping of _all field then it will work. _all has its own analyzers and I suspect you used the analyzers just for name and username and not for _all.

Resources