Return only exact matches (substrings) in full text search (elasticsearch) - elasticsearch

I have an index in elasticsearch with a 'title' field (analyzed string field). If I have the following documents indexed:
{title: "Joe Dirt"}
{title: "Meet Joe Black"}
{title: "Tomorrow Never Dies"}
and the search query is "I want to watch the movie Joe Dirt tomorrow"
I want to find results where the full title matches as a substring of the search query. If I use a straight match query, all of these documents will be returned because they all match one of the words. I really just want to return "Joe Dirt" because the title is an exact match substring of the search query.
Is that possible in elasticsearch?
Thanks!

One way to achieve this is as follows :
1) while indexing index title using keyword tokenizer
2) While searching use shingle token-filter to extract substring from the query string and match against the title
Example:
Index Settings
put test
{
"settings": {
"analysis": {
"analyzer": {
"substring": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"substring"
]
},
"exact": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"filter": {
"substring": {
"type":"shingle",
"output_unigrams" : true
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "exact"
}
}
}
}
}
}
}
Index Documents
put test/movie/1
{"title": "Joe Dirt"}
put test/movie/2
{"title": "Meet Joe Black"}
put test/movie/3
{"title": "Tomorrow Never Dies"}
Query
post test/_search
{
"query": {
"match": {
"title.raw" : {
"analyzer": "substring",
"query": "Joe Dirt tomorrow"
}
}
}
}
Result :
"hits": {
"total": 1,
"max_score": 0.015511602,
"hits": [
{
"_index": "test",
"_type": "movie",
"_id": "1",
"_score": 0.015511602,
"_source": {
"title": "Joe Dirt"
}
}
]
}

Related

How to find word 'food2u' by search 'food' in Elasticsearch?

I am a rookie who just started learning elasticsearch,And I want to find word like 'food2u' by search keyword 'food'.But I can only get the results like 'Food Repo','Give Food' etc. The field's Mapping is 'text' and this is my query
GET api/_search
{"query": {
"match": {
"Name": {
"query": "food"
}
}
},
"_source":{
"includes":["Name"]
}
}
You are getting the results like 'Food Repo','Give Food', as the text field uses a standard analyzer if no analyzer is specified. Food Repo gets tokenized into food and repo. Similarly Give Food gets tokenized into give and food.
But food2u gets tokenized into food2u. Since there is no matching token ("food"), you will not get the food2u document.
You need to use edge_ngram tokenizer to do a partial text match.
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 4,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"name":"food2u"
}
Search Query:
{
"query": {
"match": {
"name": "food"
}
}
}
Search Result:
"hits": [
{
"_index": "67552800",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "food2u"
}
}
]
If you don't want to change the mapping, you can even use a wildcard query to return the matching documents
{
"query": {
"wildcard": {
"Name": {
"value": "food*"
}
}
}
}
OR you can even use query_string with wildcard
{
"query": {
"query_string": {
"query": "food*",
"fields": [
"Name"
]
}
}
}

Space handling in Elastic Search

If a Document (Say a merchant name) that I am searching for has no space in it and user search by adding space in it, the result won't show in elastic search. How can that be improved to get results?
For example:
Merchant name is "DeliBites"
User search by typing in "Deli Bites", then the above merchant does not appear in results. The merchant only appears in suggestions when I have typed just "Deli" or "Deli" followed by a space or "Deli."
Adding another option, you can also use the edge n-gram tokenizer which will work in most of the cases, its simple to setup and use.
Working example on your data
Index definition
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
},
"index.max_ngram_diff" : 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index sample doc
{
"title" : "DeliBites"
}
Search query
{
"query": {
"match": {
"title": {
"query": "Deli Bites"
}
}
}
}
And search results
"hits": [
{
"_index": "65489013",
"_type": "_doc",
"_id": "1",
"_score": 0.95894027,
"_source": {
"title": "DeliBites"
}
}
]
I suggest using synonym token filter.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html
you should have a dictionary for all words that you want search.
something like this:
DelitBites => Deli Bites
ipod => i pod
before implementing synonym be sure you understood all aspect of it.
https://www.elastic.co/blog/boosting-the-power-of-elasticsearch-with-synonyms

ElastcSearch - Get unsorted data while retrieving results using "docvalue_fields"

I'm trying to retrieve data using docvalue_fields from a text field. Below is the mapping:
PUT dvtest
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"nl-tokenizer": {
"type": "simple_pattern_split",
"pattern": "\n|\r\n|\r|\n\r"
}
},
"analyzer": {
"newline": {
"type": "custom",
"filter": [
"trim"
],
"tokenizer": "nl-tokenizer"
}
}
}
}
},
"mappings": {
"indexMapping": {
"properties": {
"data": {
"norms": false,
"type": "text",
"analyzer": "newline",
"fielddata": true
}
}
}
}
}
Query to put sample data in index:
POST dvtest/indexMapping
{
"data":"""
The quick brown fox
Jumps over the lazy dog
"""
}
Query I'm using to retrieve data:
POST dvtest/indexMapping/_search
{
"_source": false
, "docvalue_fields": ["data"]
}
Now, When I try to retrieve data using above query I get the result as shown below. Tokens in data fields are sorted in alphabetical order and I want to retrieve it in order as it is indexed.
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "dvtest",
"_type": "indexMapping",
"_id": "Kp5jNmQBrlmAX-4PVbm9",
"_score": 1,
"fields": {
"data": [
"Jumps over the lazy dog",
"The quick brown fox"
]
}
}
]
}
So, my desired result is:
...
"fields": {
"data": [
"The quick brown fox",
"Jumps over the lazy dog"
]
}
...
I have tried searching on google for help also checked GitHub but couldn't find anything useful!
Thanks in advance for any help! :)

Understanding Elasticsearch synonym

Being very new in Elasticsearch, I'm not sure what's the best way to use synonym.
I have two fields, one is hashtag and another one is name. Hashtag containing names in lower case without whitespace whereas name contains actual name in camel case format.
I want to search based on name in the right format and want to get all matching names along with those docs where it matches hashtag as well.
For example, name contains "Tom Cruise" and hashtag is "tomcruise". I want to search "Tom Cruise" and expected result is it will return all docs which has either name "Tom Cruise" or hashtag "tomcruise".
Here is the way I'm creating this index:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"synonym" : {
"type" : "synonym",
"ignore_case" : true,
"synonyms" : [
"tom cruise => tomcruise, tom cruise"
]
}
},
"analyzer": {
"synonym" : {
"tokenizer" : "whitespace",
"filter" : ["synonym"]
}
}
}
}
}
PUT /my_index/my_type/_mapping
{
"my_type": {
"properties": {
"hashtag": {
"type": "string",
"search_analyzer": "synonym",
"analyzer": "standard"
},
"name":{
"type": "keyword"
}
}
}
}
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "hashtag": "tomcruise", "name": "abc" }
{ "index": { "_id": 2 }}
{ "hashtag": "tomhanks", "name": "efg" }
{ "index": { "_id": 3 }}
{ "hashtag": "tomcruise" , "name": "efg" }
{ "index": { "_id": 4 }}
{ "hashtag": "news" , "name": "Tom Cruise"}
{ "index": { "_id": 5 }}
{ "hashtag": "celebrity", "name": "Kate Winslet" }
{ "index": { "_id": 6 }}
{ "hashtag": "celebrity", "name": "Tom Cruise" }
When I do analyze, it looks like I get the right tokens: [tomcruise, tom, cruise]
GET /my_index/_analyze
{
"text": "Tom Cruise",
"analyzer": "synonym"
}
Here's how I'm searching:
POST /my_index/my_type/_search?pretty
{
"query":
{
"multi_match": {
"query": "Tom Cruise",
"fields": [ "hashtag", "name" ]
}
}
}
Is this the right way to archive my search requirement?
What's the best way to search like this on Kibana? I have to use the entire query but what I need to do if I want to just type "Tom Cruise" and want to get the expected result? I tried with "_all" but didn't work.
Updated:
After discussing with Russ Cam and with my little knowledge of Elasticsearch, I thought it will be overkill to use synonym for my search requirement. So I changed search analyzer to generate same token and got the same result. Still want to know whether I'm doing it in the right way.
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"word_joiner": {
"type": "word_delimiter",
"catenate_all": true
}
},
"analyzer": {
"test_analyzer" : {
"type": "custom",
"tokenizer" : "keyword",
"filter" : ["lowercase", "word_joiner"]
}
}
}
}
}

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Resources