Elasticsearch - How to search for multiple words in one string - elasticsearch

I'm having issues getting the elasticsearch results i need.
My mappings look like this:
"mappings": {
"product": {
"_meta": {
"model": "App\\Entity\\Product"
},
"dynamic_date_formats": [],
"properties": {
"articleNameSearch": {
"type": "text",
"analyzer": "my_analyzer"
},
"articleNumberSearch": {
"type": "text",
"fielddata": true
},
"brand": {
"type": "nested",
"properties": {
"name": {
"type": "text"
}
}
}
}
}
},
My settings:
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "my_index",
"creation_date": "1572252785482",
"analysis": {
"filter": {
"standard": {
"type": "standard"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"standard"
],
"type": "custom",
"tokenizer": "lowercase"
}
}
},
"number_of_replicas": "1",
"uuid": "bwmc7NZ9RXqB1lpQ3e8HTQ",
"version": {
"created": "5060399"
}
}
}
The data inside:
"hits": [
{
"_index": "my_index",
"_type": "product",
"_id": "14",
"_score": 1.0,
"_source": {
"articleNumberSearch": "5003xx843",
"articleNameSearch": "this is a test string",
"brand": {
"name": "Brand name"
}
}
},
Currently the PHP code for the query looks like this (this does not return correct records):
$searchQuery = new BoolQuery();
$formattedQuery = "*" . str_replace(['.', '|'], '', trim(mb_strtolower($query))) . "*";
/**
* Test NGRAM analyzer
*/
$matchQuery = new Query\MultiMatch();
$matchQuery->setFields([
'articleNumberSearch',
'articleNameSearch',
]);
$matchQuery->setQuery($formattedQuery);
$searchQuery->addMust($matchQuery);
/**
* Nested query
*/
$nestedQuery = new Nested();
$nestedQuery->setPath('brand');
$nestedQuery->setQuery(
new Match('brand.name', 'Brand name')
);
$searchQuery->addMust($nestedQuery);
I'm creating and auto-complete search field, where you can search articleNumberSearch and articleNameSearch while brand name is always a fixed value.
I want to be able to search for example:
500 will find this hit, because 500 is in the articleNumberSearch.
But also be able to search:
this is string
Couple questions:
Which query do i need to use?
Am i using the right analyzer?
Is my anaylizer correctly configured?

You should create an ngram type tokenizer.
The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters.
Something like that:
"analysis": {
"analyzer": {
"autocomplete": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
],
"min_gram": "1",
"type": "ngram",
"max_gram": "2"
}
}
}
NGram Tokenizer

Related

why is shingle token filter with analyser isn't yielding expected results?

Hi here are my index details:
PUT shingle_test
{
"settings": {
"analysis": {
"analyzer": {
"evolutionAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"custom_shingle"
]
}
},
"filter": {
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "10",
"output_unigrams": false
}
}
}
},
"mappings": {
"legacy" : {
"properties": {
"name": {
"type": "text",
"fields": {
"shingles": {
"type": "text",
"analyzer": "standard",
"search_analyzer": "evolutionAnalyzer"
},
"as_is": {
"type": "keyword"
}
},
"analyzer": "standard"
}
}
}
}
}
Added 2 docs
PUT shingle_test/legacy/1
{
"name": "Chandni Chowk 2 Banglore"
}
PUT shingle_test/legacy/2
{
"name": "Chandni Chowk"
}
Nothing is being returned if I do this,
GET shingle_test/_search
{
"query": {
"match": {
"name": {
"query": "Chandni Chowk",
"analyzer": "evolutionAnalyzer"
}
}
}
}
Looked at all possible solutions online, didn't get any.
Also, if I do "output_unigrams": true, then it just works like match query and gives results.
The thing I'm trying to achieve:
Having these documents:
Chandni Chowk 2 Bangalore
Chandni Chowk
CCD Bangalore
Istah shawarma and biryani
Istah
So,
searching for "Chandni Chowk 2 Bangalore" should return 1, 2
searching for "Chandni Chowk" should return 1, 2
searching for "Istah shawarma and biryani" should return 4, 5
searching for "Istah" should return 4, 5
searching for "CCD Bangalore" should return 3
note: search keyword will always be exactly equal to value of the name field in the document ex: In this particular index, we can query "Chandni Chowk 2 Bangalore", "Chandni Chowk", "CCD Bangalore", "Istah shawarma and biryani", "Istah". "CCD" won't be queried on this index.
The analyzer parameter specifies the analyzer used for text analysis when indexing or searching a text field.
Modify your index mapping as
{
"settings": {
"analysis": {
"analyzer": {
"evolutionAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"custom_shingle"
]
}
},
"filter": {
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "10",
"output_unigrams": true // note this
}
}
}
},
"mappings": {
"legacy" : {
"properties": {
"name": {
"type": "text",
"fields": {
"shingles": {
"type": "text",
"analyzer": "evolutionAnalyzer", // note this
"search_analyzer": "evolutionAnalyzer"
},
"as_is": {
"type": "keyword"
}
},
"analyzer": "standard"
}
}
}
}
}
And, the modified search query will be
{
"query": {
"match": {
"name.shingles": {
"query": "Chandni Chowk"
}
}
}
}
Search Results:
"hits": [
{
"_index": "66127416",
"_type": "_doc",
"_id": "2",
"_score": 0.25759193,
"_source": {
"name": "Chandni Chowk"
}
},
{
"_index": "66127416",
"_type": "_doc",
"_id": "1",
"_score": 0.19363807,
"_source": {
"name": "Chandni Chowk 2 Banglore"
}
}
]

How to match when search term has more words than index?

I have an index which is 2-4 characters with no spaces but user often searches for the "full term" which I dont have indexed but has 3 extra characters after a blank space.
Ex: I index "A1" or "A1B" or "A1B2" and the "full term" is something like
"A1 11A" or "A1B ABA" or "A1B2 2C8".
This is current mapping:
"code": {
"type": "text"
},
If he searches "A1" it bring all of them which is also correct, if he types "A1B" I want to bring only the last two and if he searches "A1B2 2C8" I want to bring only the last one.
Is that possible? If so, what would be the best search/index strategy?
Index Mapping:
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"code": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index data:
{
"code": "A1"
}
{
"code": "A1B"
}
{
"code": "A1B2"
}
Search Query:
{
"query": {
"match": {
"code": {
"query": "A1B2 2C8"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65067196",
"_type": "_doc",
"_id": "3",
"_score": 1.3486402,
"_source": {
"code": "A1B2"
}
}
]

Elasticsearch Only Searching From Start

Currently, Elasticsearch is only searching through the mapped items from the beginning of the string instead of throughout the string.
I have a custom analyzer, as well as a custom edge ngram tokenizer.
I am currently using bool queries from within javascript to search the index.
Index
{
"homestead_dev_index": {
"aliases": {},
"mappings": {
"elasticprojectnode": {
"properties": {
"archived": {
"type": "boolean"
},
"id": {
"type": "text",
"analyzer": "full_name"
},
"name": {
"type": "text",
"analyzer": "full_name"
}
}
}
},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "homestead_dev_index",
"creation_date": "1535439085947",
"analysis": {
"analyzer": {
"full_name": {
"filter": [
"standard",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "mytok"
}
},
"tokenizer": {
"mytok": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "10"
}
}
},
"number_of_replicas": "1",
"uuid": "iCa7qKJVRU-_MA8sCYIAXw",
"version": {
"created": "5060399"
}
}
}
}
}
Query Body
{
"query": {
"bool": {
"should": [
{ "match": { "name": this.searchString } },
{ "match": { "id": this.searchString } }
]
}
},
"highlight": {
"pre_tags": ["<b style='background-color:yellow'>"],
"post_tags": ["</b>"],
"fields": {
"name": {},
"id": {}
}
}
}
Example
If I have projects with the names "Road - Area 1", "Road - Area 2" and "Sub-area 5 - Road" and the user searches for "Road", only "Road - Area 1" and "Road - Area 2" will display with the word "Road" highlighted in yellow.
The code needs to pick up the final project as well.
I seem to have figured it out.
In the original description, I am using the edge_ngram tokenizer when I am supposed to be using the ngram tokenizer.
Found on: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html#_partial_word_tokenizers

Google style autocomplete & autocorrection with elasticsearch

I'm trying to achieve google style autocomplete & autocorrection with elasticsearch.
Mappings :
POST music
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"song": {
"properties": {
"song_field": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"search_analyzer": "simple",
"payloads": true
}
}
}
}
}
Docs:
POST music/song
{
"song_field" : "beautiful queen",
"suggest" : "beautiful queen"
}
POST music/song
{
"song_field" : "beautiful",
"suggest" : "beautiful"
}
I expect that when user types: "beaatiful q" he will get something like beautiful queen (beaatiful is corrected to beautiful and q is completed to queen).
I've tried the following query:
POST music/song/_search?search_type=dfs_query_then_fetch
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatiful q",
"completion": {
"field": "suggest"
}
}
},
"query": {
"match": {
"song_field": {
"query": "beaatiful q",
"fuzziness": 2
}
}
}
}
Unfortunately, Completion suggester doesn't allow any typos so I get this response:
"suggest": {
"didYouMean": [
{
"text": "beaatiful q",
"offset": 0,
"length": 11,
"options": []
}
]
}
In addition, search gave me these results (beautiful ranked higher although user started to wrote "queen"):
"hits": [
{
"_index": "music",
"_type": "song",
"_id": "AVUj4Y5NancUpEdFLeLo",
"_score": 0.51315063,
"_source": {
"song_field": "beautiful"
"suggest": "beautiful"
}
},
{
"_index": "music",
"_type": "song",
"_id": "AVUj4XFAancUpEdFLeLn",
"_score": 0.32071912,
"_source": {
"song_field": "beautiful queen"
"suggest": "beautiful queen"
}
}
]
UPDATE !!!
I found out that I can use fuzzy query with completion suggester, but now I get no suggestions when querying (fuzzy only supports 2 edit distance):
POST music/song/_search
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatefal q",
"completion": {
"field": "suggest",
"fuzzy" : {
"fuzziness" : 2
}
}
}
}
}
I still expect "beautiful queen" as suggestion response.
When you want to provide 2 or more words as search suggestions, I have found out (the hard way), its not worth it to use ngrams or edgengrams in Elasticsearch.
Using the Shingles token filter and the shingles analyzer will provide you with multi-word phrases and if you couple that with the match_phrase_prefix it should give you the functionality your looking for.
Basically something like this:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
}
}
And don't forget to do your mapping:
{
"my_type": {
"properties": {
"title": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
}
}
}
}
Ngrams and edgengrams are going tokenize single characters, whereas the Shingles analyzer and filters, groups letters (making words) and provide a much more efficient way of producing and searching for phrases. I spent alot of time messing with the 2 above until I saw Shingles mentioned and read up on it. Much better.

Get suggestion on field Elasticsearch

I am trying to make a suggestion feature with Elasticsearch.
following this article https://qbox.io/blog/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams
What I have now works but not for two words in the same sentence.
The data I have now in ES is.
{
"_index": "books",
"_type": "book",
"_id": "AVJp8p4ZTfM-Ee45GnF5",
"_score": 1,
"_source": {
"title": "Making a dish",
"author": "Jim haunter"
}
},
{
"_index": "books",
"_type": "book",
"_id": "AVJp8jaZTfM-Ee45GnF4",
"_score": 1,
"_source": {
"title": "The big fish",
"author": "Jane Stewart"
}
},
{
"_index": "books",
"_type": "book",
"_id": "AVJp8clRTfM-Ee45GnF3",
"_score": 1,
"_source": {
"title": "The Hunter",
"author": "Jame Franco"
}
}
Here is the mapping and settings.
{"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"books": {
"_all": {
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"title": {
"type": "string",
"index": "no"
},
"author": {
"type": "string",
"index": "no"
}
}
}
}
}
Here is the search
{
"size": 10,
"query": {
"match": {
"_all": {
"query": "Hunter",
"operator": "and",
"fuzziness": 1
}
}
}
}
when I search for "The" I get
"The big fish" and
"The hunter".
However when I enter "The Hunt" I get nothing.
To get the book again I need to enter "The Hunte".
Any suggestions?
Any help appreciated.
Removing "index": "no" from the fields worked for me. Also, since I'm using ES 2.x, I had to replace "index_analyzer" with "analyzer". So here is the mapping:
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"books": {
"_all": {
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"title": {
"type": "string"
},
"author": {
"type": "string"
}
}
}
}
}
Here's some code I used to test it:
http://sense.qbox.io/gist/0140ee0f5043f66e76cc3109a18d573c1d09280b

Resources