ElasticSearch Reverse Wildcard Search - elasticsearch

In ElasticSearch v5.2.2 I can search for "Jo*" using Wildcard and it will match the index value containing "Joseph"
But what if my index also has these values "Joseph","Jo", "Jos", "Jose" and "Josep" and I want to reverse the query.
How can I find "Jo", "Jos", "Jose" and "Josep" in the index using the string "Joseph" as search criteria?

That's possible, but you need to create an edgeNGram search analyzer in your index settings.
First create the settings like this. The name field will be indexed with the standard analyzer but searched with your custom prefix_search analyzer instead.
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"prefix_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"prefix"
]
}
},
"filter": {
"prefix": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"analyzer": "standard",
"search_analyzer": "prefix_search"
}
}
}
}
}
Then if you create a document like this:
PUT test/doc/1
{
"name": "Jos"
}
You can find it with a query like this one:
POST /test/doc/_search
{
"query": {
"match": {
"name": "Joseph"
}
}
}

Related

Elastic search partial substring search

I am trying to implement partial substring search in elastic serach 7.1 using following analyzer
PUT my_index-001
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete"
]
},
"autocomplete_search": {
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
},
"filter": {
"autocomplete": {
"type": "nGram",
"min_gram": 2,
"max_gram": 40
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
After that i tried adding some sample data to my_index-001 and type doc
PUT my_index-001/doc/1
{
"title": "ABBOT Series LTD 2014"
}
PUT my_index-001/doc/2
{
"title": "ABBOT PLO LTD 2014A"
}
PUT my_index-001/doc/3
{
"title": "ABBOT TXT"
}
PUT my_index-001/doc/4
{
"title": "ABBOT DMO LTD. 2016-II"
}
Query used to perform partial search :
GET my_index-001/_search
{
"query": {
"match": {
"title": {
"query": "ABB",
"operator": "or"
}
}
}
}
I was expecting the following output from the analyzer
If i type in ABB i should get docid 1,2,3,4
If i type in ABB 2014 i should get docid 1,2
IF i type in ABBO PLO i should get doc 2
If i type in TXT i should get doc 3
With the above analyzer setting i am not getting expected results .
Please let me know if i am missing anything in my analyzer setting of Elastic search
You were almost there but there are a couple of issues.
When creating index mappings through Kibana Dev Tools, there mustn't be any whitespace between the URI and the request body. You have whitespace in the first code snippet which caused ES to ignore the request body entirely! So remove that whitespace.
The maximum ngram difference is set to 1 by default. In order to use your high ngram intervals, you'll need to explicitly increase the index-level setting max_ngram_diff:
PUT my_index-001
{
"settings": {
"index": {
"max_ngram_diff": 40 <--
},
...
}
}
Type names are deprecated in v7. So is the nGram token filter in favor of ngram (lowercase g). And so is the string field type too! Here's the corrected PUT request body:
PUT my_index-001 <--- no whitespace after the URI!
{
"settings": {
"index": {
"max_ngram_diff": 40 <--- explicit setting
},
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete"
]
},
"autocomplete_search": {
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
},
"filter": {
"autocomplete": {
"type": "ngram", <--- ngram, not nGram
"min_gram": 2,
"max_gram": 40
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text", <--- text, not string
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
Since different mapping types had been deprecated in favor of the generic _doc type, you'll need to adjust the way you insert documents. The only difference, luckily, is changing doc to _doc in the URI:
PUT my_index-001/_doc/1
{ "title": "ABBOT Series LTD 2014" }
PUT my_index-001/_doc/2
{ "title": "ABBOT PLO LTD 2014A" }
PUT my_index-001/_doc/3
{ "title": "ABBOT TXT" }
PUT my_index-001/_doc/4
{ "title": "ABBOT DMO LTD. 2016-II" }
Finally, your query is perfectly fine and should behave the way you expect it to. The only thing to change is the operator to and when querying for two or more substrings, i.e.:
GET my_index-001/_search
{
"query": {
"match": {
"title": {
"query": "ABB 2014",
"operator": "and"
}
}
}
}
Other than that, all four of your test scenarios should return what you expect.

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}
Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}
I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

Get exact match after doing mapping as not_analyzed

I have elasticsearch type I mapped as below,
mappings": {
"jardata": {
"properties": {
"groupID": {
"index": "not_analyzed",
"type": "string"
},
"artifactID": {
"index": "not_analyzed",
"type": "string"
},
"directory": {
"type": "string"
},
"jarFileName": {
"index": "not_analyzed",
"type": "string"
},
"version": {
"index": "not_analyzed",
"type": "string"
}
}
}
}
I am using index of directory as analyzed since I want give only the last folder and get the results, But when I want to search a specific directory I need to give the whole path since there can be same folder in two paths. The problem here is since it is analyzed it will all data instead the specific one I want.
The problem here is I want to act it like both analyzed and not_analyzed. is there a way for that?
Let's say you have the following document indexed:
{
"directory": "/home/docs/public"
}
The standard analyzer is not enough in your case as it will create following terms while indexing:
[home, docs, public]
Note that it misses [/home/docs/public] token - characters like "/" etc. are acting as separators here.
One solution could be to use NGram tokenizer with punctuation character class in token_chars list. Elasticsearch would treat "/" as it would be a letter or digit. This would allow to search with following tokens:
[/hom, /home, ..., /home/docs/publi, /home/docs/public, ..., /docs/public, etc...]
Index mapping:
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 18,
"token_chars": [
"letter",
"digit",
"punctuation"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "ngram_analyzer"
}
}
}
}
}
Now both search queries:
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/docs/private"
}
}
}
}
}
and
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/home/docs/private"
}
}
}
}
}
will give the indexed document in result.
One thing you have to consider is the maximum length of the token that is specified in "max_gram" setting. In case of directory paths it could be necessary to have it longer.
Alternative solution is to use Whitespace tokenizer, that breaks the phrase into terms only on whitespaces, and NGram filter with following mapping:
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 4,
"max_gram": 20
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
update the mapping of the directory field to contain raw field like this:
"directory": {
"type": "string",
"fields": {
"raw": {
"index": "not_analyzed",
"type": "string"
}
}
}
And modify your query to include directory.raw which will treat it like not_analyzed. Refer this.

wildcard on different tokens in elastic search

I have a document which looks like this
Name
Thomy tyson Olando Magua
Using ngram i was able to acheive the wildcard search so that if i type in omy tyson it can return me the above document pretty much similar to this sql query
select name from table where name like '%omy tyson%'
PUT sample
{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "15"
}
}
}
},
"mappings": {
"typename": {
"properties": {
"name": {
"type": "string",
"fields": {
"search": {
"type": "string",
"analyzer": "my_ngram_analyzer"
}
}
}
}
}
}
}
PUT sample/typename/2
{
"name": "Thomy tyson Olando Magua"
}
{
"query": {
"bool": {
"should": [
{
"term": {
"name.search": "omy tyson"
}
}
]
}
}
}
Is there a way in elastic search where i can perform wildcard search on 2 different words separated by other words like
select name from table where name like '%omy Magua%'
So in this case i would like to perform partial search on first and fourth word.
Any feedback would be helpfull

ngrams ins elasticsearch are not working

I use elasticsearch ngram
"analysis": {
"filter": {
"desc_ngram": {
"type": "ngram",
"min_gram": 3,
"max_gram": 8
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "desc_ngram", "lowercase" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
And I have 2 objects here
{
"name": "Shana Calandra",
"username": "shacalandra",
},
{
"name": "Shana Launer",
"username": "shalauner",
},
And using this query
{
query: {
match: {
_all: "Shana"
}
}
}
When I search with this query, it returns me both documents, but I cant search by part of word here, for example I cant use "Shan" instead of "Shana" in query because it doesnt return anything.
Maybe my mapping is wrong, I cant understand problem is on mapping or on query
If you specify
"mappings": {
"test": {
"_all": {
"index_analyzer": "index_ngram",
"search_analyzer": "search_ngram"
},
for your mapping of _all field then it will work. _all has its own analyzers and I suspect you used the analyzers just for name and username and not for _all.

Resources