Elastic/Kibana: support for plurals in query searches - elasticsearch

I'll simplify my issue. Let's say I have an index with 3 documents I've created with Kibana:
PUT /test/vendors/1
{
"type": "doctor",
"name": "Phil",
"works_in": [
{
"place": "Chicago"
},
{
"place": "New York"
}
]
}
PUT /test/vendors/2
{
"type": "lawyer",
"name": "John",
"works_in": [
{
"place": "Chicago"
},
{
"place": "New Jersey"
}
]
}
PUT /test/vendors/3
{
"type": "doctor",
"name": "Jill",
"works_in": [
{
"place": "Chicago"
}
]
}
Now I'm running a search:
GET /test/_search
{
"query": {
"multi_match" : {
"query": "doctor in chicago",
"fields": [ "type", "place" ]
}
}
}
And I'm getting a good response:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.2876821,
"hits": [
{
"_index": "test",
"_type": "vendors",
"_id": "1",
"_score": 0.2876821,
"_source": {
"type": "doctor",
"name": "Phil",
"works_in": [
{
"place": "Chicago"
},
{
"place": "New York"
}
]
}
},
{
"_index": "test",
"_type": "vendors",
"_id": "3",
"_score": 0.2876821,
"_source": {
"type": "doctor",
"name": "Jill",
"works_in": [
{
"place": "Chicago"
}
]
}
}
]
}
}
Now things begin to get problematic...
Changed the doctor to doctors
GET /test/_search
{
"query": {
"multi_match" : {
"query": "doctors in chicago",
"fields": [ "type", "place" ]
}
}
}
Zero results as doctors not found. Elastic doesn't know about plural vs. singular.
Change the query to New York
GET /test/_search
{
"query": {
"multi_match" : {
"query": "doctor in new york",
"fields": [ "type", "place" ]
}
}
}
But the response result set gives me the doctor in Chicago in addition to the doctor in New York. the fields are matched with OR...
Another interesting question is, what happens if someone usesdocs or physicians or health professionals but means doctor. IS there a provision where I can teach Elasticsearch to funnel those into "doctor"?
Is there any pattern, way around these using elasticsearch alone? where I won't have to analyze the string for meaning in my own application which will then construct a complex exact elasticsearch query to match it?
I would appreciate any pointer in the right direction

I'm assuming that the fields type and place are of Text type with Standard Analyzers.
To manage singular/plurals, what you are looking for is called Snowball Token Filter which you would need to add to the mapping.
Another requirement that you've mentioned saying for e.g. physicians should also be equated as doctor, you need to make use of Synonym Token Filter
Below is how your mapping should be. Note that I've just added analyzer to type. You can make similar changes to the mapping to the other fields.
Mapping
PUT <your_index_name>
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"tokenizer":"standard",
"filter":[
"lowercase",
"my_snow",
"my_synonym"
]
}
},
"filter":{
"my_snow":{
"type":"snowball",
"language":"English"
},
"my_synonym":{
"type":"synonym",
"synonyms":[
"docs, physicians, health professionals, doctor"
]
}
}
}
},
"mappings":{
"mydocs":{
"properties":{
"type":{
"type":"text",
"analyzer":"my_analyzer"
},
"place":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}
}
}
Notice how I've added the synonyms in the mapping itself, instead of that I'd suggest you to have the synonyms added in text file like below
{
"type":"synonym",
"synonyms_path" : "analysis/synonym.txt"
}
According to the link I've shared, it mentions that the above configures a synonym filter, with a path of analysis/synonym.txt (relative to the config location).
Hope it helps!

Related

Is it possible to boost suggestions based on the Elasticsearch index

I have multiple indices that I want suggestions from, but I want to score/order the suggestions based on the index they're from. I've successfully boosted searches based on indices (using indices_boost), but this doesn't seem to work for suggestions. I tried something like:
GET index1,index2/_search
{
"indices_boost" : [
{ "index1" : 9 },
{ "index2" : 1 }
],
"suggest": {
"mySuggest":{
"text":"someText",
"completion": {
"field":"suggestField",
"size":6
}
}
}
}
Is this doable?
At the moment I've resorted to sorting the suggestions in code.
I believe you can try to use category boost in context suggester to achieve the desired behavior. You need to attach a special category field to each suggestion document, which can be exactly the same as the index name.
How to use category context to boost suggestions
The mapping may look like this:
PUT food
{
"mappings": {
"properties" : {
"suggestField" : {
"type" : "completion",
"contexts": [
{
"name": "index_name",
"type": "category"
}
]
}
}
}
}
For demonstration purposes I will create another index, exactly like the one above but with name movie. (Index names can be arbitrary.)
Let's add the suggest documents:
PUT food/_doc/1
{
"suggestField": {
"input": ["timmy's", "starbucks", "dunkin donuts"],
"contexts": {
"index_name": ["food"]
}
}
}
PUT movie/_doc/2
{
"suggestField": {
"input": ["star wars"],
"contexts": {
"index_name": ["movie"]
}
}
}
Now we can run a suggest query with our boosts set:
POST food,movie/_search
{
"suggest": {
"my_suggestion": {
"prefix": "star",
"completion": {
"field": "suggestField",
"size": 10,
"contexts": {
"index_name": [
{
"context": "movie",
"boost": 9
},
{
"context": "food",
"boost": 1
}
]
}
}
}
}
}
Which will return something like this:
{
"suggest": {
"my_suggestion": [
{
"text": "star",
"offset": 0,
"length": 4,
"options": [
{
"text": "star wars",
"_index": "movie",
"_type": "_doc",
"_id": "2",
"_score": 9.0,
"_source": ...,
"contexts": {
"index_name": [
"movie"
]
}
},
{
"text": "starbucks",
"_index": "food",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": ...,
"contexts": {
"index_name": [
"food"
]
...
Why didn't indices_boost work?
Seems like indices_boost parameter only affects the search, not suggest. _suggest used to be a standalone endpoint, but was deprecated, and probably this is the source of confusion.
Hope that helps!

Name searching in ElasticSearch

I have a index created in ElasticSearch with the field name where I store the whole name of a person: Name and Surname. I want to perform full text search over that field so I have indexed it using the analyzer.
My issue now is that if I search:
"John Rham Rham"
And in the index I had "John Rham Rham Luck", that value has higher score than "John Rham Rham".
Is there any posibility to have better score on the exact field than in the field with more values in the string?
Thanks in advance!
I worked out a small example (assuming you're running on ES 5.x cause of the difference in scoring):
DELETE test
PUT test
{
"settings": {
"similarity": {
"my_bm25": {
"type": "BM25",
"b": 0
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "text",
"similarity": "my_bm25",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
}
POST test/test/1
{
"name": "John Rham Rham"
}
POST test/test/2
{
"name": "John Rham Rham Luck"
}
GET test/_search
{
"query": {
"function_score": {
"query": {
"match": {
"name": {
"query": "John Rham Rham",
"operator": "and"
}
}
},
"functions": [
{
"script_score": {
"script": "_score / doc['name.length'].getValue()"
}
}
]
}
}
}
This code does the following:
Replace the default BM25 implementation with a custom one, tweaking the B parameter (field length normalisation)
-- You could also change the similarity to 'classic' to go back to TF/IDF which doesn't have this normilisation
Create an inner field for your name field, which counts the number of tokens inside your name field.
Update the score according to the length of the token
This will result in:
"hits": {
"total": 2,
"max_score": 0.3596026,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.3596026,
"_source": {
"name": "John Rham Rham"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.26970196,
"_source": {
"name": "John Rham Rham Luck"
}
}
]
}
}
Not sure if this is the best way of doing it, but it maybe point you in the right direction :)

Cross Field Search with Multiple Complete and Incomplete Phrases in Each Field

Example data:
PUT /test/test/1
{
"text1":"cats meow",
"text2":"12345",
"text3":"toy"
}
PUT /test/test/2
{
"text1":"dog bark",
"text2":"98765",
"text3":"toy"
}
And an example query:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats toy",
"type" : "cross_fields"
}
}
}
Returns the cat hit first and then the dog, which is what I want.
BUT when you query cat toy, both the cat and dog have the same relevance score. I want to be able to take into consideration the prefix of that word (and maybe a few other words inside that field), and run cross_fields.
So if I search:
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "phrase_prefix"
}
}
}
or
GET /test/test/_search
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "meow cats",
"type" : "phrase_prefix"
}
}
}
I should get the cat/ID 1, but I did not.
I found that using cross_fields achieves multi-word phrases, but not multi-incomplete phrases. And phrase_prefix achieves incomplete phrases, but not multiple incomplete phrases...
Sifting through the documentation really isn't helping me discover how to combine these two.
Yeah, I had to apply an analyzer...
The analyzer is applied to the fields when creating the index before you add any data. I couldn't find an easier way to do this after you add the data.
The solution I have found is exploding all of the phrases into each individual prefixes so cross_fields can do it's magic. You can learn more about the use of edge-ngram here.
So instead of cross_field just searching the cats phrase, it's now going to search: c, ca, cat, and cats and every phrase after... So the text1 field will look like this to elastic: c ca cat cats m me meo meow.
~~~
Here are the steps to make the above question example work:
First you create and name the analyzer. To learn a bit more what the filter's values mean, I recommend you take a look at this.
PUT /test
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Then I attached this analyzer to each field.
I changed the text1 to match the field I was applying this to.
PUT /test/_mapping/test
{
"test": {
"properties": {
"text1": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
}
I ran GET /test/_mapping to be sure everything worked.
Then to add the data:
POST /test/test/_bulk
{ "index": { "_id": 1 }}
{ "text1": "cats meow", "text2": "12345", "text3": "toy" }
{ "index": { "_id": 2 }}
{ "text1": "dog bark", "text2": "98765", "text3": "toy" }
And the search!
{
"size": 25,
"query": {
"multi_match" : {
"fields" : [
"text1",
"text2",
"text3"
],
"query" : "cat toy",
"type" : "cross_fields"
}
}
}
Which returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.70778143,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.70778143,
"_source": {
"text1": "cats meow",
"text2": "12345",
"text3": "toy"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.1278426,
"_source": {
"text1": "dog bark",
"text2": "98765",
"text3": "toy"
}
}
]
}
}
This creates contrast between the two when you search cat toy, where as before the score was the same. But now, the cat hit has a higher score, as it should. This is achieved by taking into consideration every prefix (max 20 characters in this case/phrase) for each phrase and then seeing how relevant the data is with cross_fields.

elasticsearch: How to rank first appearing words or phrases higher

For example, if I have the following documents:
1. Casa Road
2. Jalan Casa
Say my query term is "cas"... on searching, both documents have same scores. I want the one with casa appearing earlier (i.e. document 1 here) and to rank first in my query output.
I am using an edgeNGram Analyzer. Also I am using aggregations so I cannot use the normal sorting that happens after querying.
You can use the Bool Query to boost the items that start with the search query:
{
"bool" : {
"must" : {
"match" : { "name" : "cas" }
},
"should": {
"prefix" : { "name" : "cas" }
},
}
}
I'm assuming the values you gave is in the name field, and that that field is not analyzed. If it is analyzed, maybe look at this answer for more ideas.
The way it works is:
Both documents will match the query in the must clause, and will receive the same score for that. A document won't be included if it doesn't match the must query.
Only the document with the term starting with cas will match the query in the should clause, causing it to receive a higher score. A document won't be excluded if it doesn't match the should query.
This might be a bit more involved, but it should work.
Basically, you need the position of the term within the text itself and, also, the number of terms from the text. The actual scoring is computed using scripts, so you need to enable dynamic scripting in elasticsearch.yml config file:
script.engine.groovy.inline.search: on
This is what you need:
a mapping that is using term_vector set to with_positions, and edgeNGram and a sub-field of type token_count:
PUT /test
{
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions",
"index_analyzer": "edgengram_analyzer",
"search_analyzer": "keyword",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "standard"
}
}
}
}
}
},
"settings": {
"analysis": {
"filter": {
"name_ngrams": {
"min_gram": "2",
"type": "edgeNGram",
"max_gram": "30"
}
},
"analyzer": {
"edgengram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"name_ngrams"
],
"tokenizer": "standard"
}
}
}
}
}
test documents:
POST /test/test/1
{"text":"Casa Road"}
POST /test/test/2
{"text":"Jalan Casa"}
the query itself:
GET /test/test/_search
{
"query": {
"bool": {
"must": [
{
"function_score": {
"query": {
"term": {
"text": {
"value": "cas"
}
}
},
"script_score": {
"script": "termInfo=_index['text'].get('cas',_POSITIONS);wordCount=doc['text.word_count'].value;if (termInfo) {for(pos in termInfo){return (wordCount-pos.position)/wordCount}};"
},
"boost_mode": "sum"
}
}
]
}
}
}
and the results:
"hits": {
"total": 2,
"max_score": 1.3715843,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1.3715843,
"_source": {
"text": "Casa Road"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.8715843,
"_source": {
"text": "Jalan Casa"
}
}
]
}

Elastic Search- Fetch Distinct Tags

I have document of following format:
{
_id :"1",
tags:["guava","apple","mango", "banana", "gulmohar"]
}
{
_id:"2",
tags: ["orange","guava", "mango shakes", "apple pie", "grammar"]
}
{
_id:"3",
tags: ["apple","grapes", "water", "gulmohar","water-melon", "green"]
}
Now, I want to fetch unique tags value from whole document 'tags field' starting with prefix g*, so that these unique tags will be display by tag suggestors(Stackoverflow site is an example).
For example: Whenever user types, 'g':
"guava", "gulmohar", "grammar", "grapes" and "green" should be returned as a result.
ie. the query should returns distinct tags with prefix g*.
I tried everywhere, browse whole documentations, searched es forum, but I didn't find any clue, much to my dismay.
I tried aggregations, but aggregations returns the distinct count for whole words/token in tags field. It does not return the unique list of tags starting with 'g'.
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"query_string": {
"allow_leading_wildcard": false,
"fields": [
"tags"
],
"query": "g*",
"fuzziness":0
}
}
]
}
},
"filter": {
//some condition on other field...
}
}
},
"aggs": {
"distinct_tags": {
"terms": {
"field": "tags",
"size": 10
}
}
},
result of above: guava(w), apple(q), mango(1),...
Can someone please suggest me the correct way to fetch all the distinct tags with prefix input_prefix*?
It's a bit of a hack, but this seems to accomplish what you want.
I created an index and added your docs:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"tags":["guava","apple","mango", "banana", "gulmohar"]}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"tags": ["orange","guava", "mango shakes", "apple pie", "grammar"]}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"tags": ["guava","apple","grapes", "water", "grammar","gulmohar","water-melon", "green"]}
Then I used a combination of prefix query and highlighting as follows:
POST /test_index/_search
{
"query": {
"prefix": {
"tags": {
"value": "g"
}
}
},
"fields": [ ],
"highlight": {
"pre_tags": [""],
"post_tags": [""],
"fields": {
"tags": {}
}
}
}
...
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"highlight": {
"tags": [
"guava",
"gulmohar"
]
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1,
"highlight": {
"tags": [
"guava",
"grammar"
]
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"highlight": {
"tags": [
"guava",
"grapes",
"grammar",
"gulmohar",
"green"
]
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/c14675ee8bd3934389a6cb0c85ff57621a17bf11
What you're trying to do amounts to autocomplete, of course, and there are perhaps better ways of going about that than what I posted above (though they are a bit more involved). Here are a couple of blog posts we did about ways to set up autocomplete:
http://blog.qbox.io/quick-and-dirty-autocomplete-with-elasticsearch-completion-suggest
http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams
As per #Sloan Ahrens advice, I did following:
Updated the mapping:
"tags": {
"type": "completion",
"context": {
"filter_color": {
"type": "category",
"default": "",
"path": "fruits.color"
},
"filter_type": {
"type": "category",
"default": "",
"path": "fruits.type"
}
}
}
Reference: ES API Guide
Inserted these indexes:
{
_id :"1",
tags:{input" :["guava","apple","mango", "banana", "gulmohar"]},
fruits:{color:'bar',type:'alice'}
}
{
_id:"2",
tags:{["orange","guava", "mango shakes", "apple pie", "grammar"]}
fruits:{color:'foo',type:'bob'}
}
{
_id:"3",
tags:{ ["apple","grapes", "water", "gulmohar","water-melon", "green"]}
fruits:{color:'foo',type:'alice'}
}
I don't need to modify much, my original index. Just added input before tags array.
POST rescu1/_suggest?pretty'
{
"suggest": {
"text": "g",
"completion": {
"field": "tags",
"size": 10,
"context": {
"filter_color": "bar",
"filter_type": "alice"
}
}
}
}
gave me the desired output.
I accepted #Sloan Ahrens answer as his suggestions worked like a charm for me, and he showed me the right direction.

Resources