I think I should explain my problem with an example:
Assume that I've created index with synonym analyzer and I declare that "laptop", "phone" and "tablet" are similar words that can be generalized as "mobile":
PUT synonym
{
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 2,
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"phone, tablet, laptop => mobile"
]
}
}
}
}
},
"mappings": {
"synonym" : {
"properties" : {
"field1" : {
"type" : "text",
"analyzer": "synonym",
"search_analyzer": "synonym"
}
}
}
}
}
Now I am creating some docs:
PUT synonym/synonym/1
{
"field1" : "phone"
}
PUT synonym/synonym/2
{
"field1" : "tablet"
}
PUT synonym/synonym/3
{
"field1" : "laptop"
}
Now when I match query for laptop, tablet or phone, the result is always:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.2876821,
"hits": [
{
"_index": "synonym",
"_type": "synonym",
"_id": "2",
"_score": 0.2876821,
"_source": {
"field1": "tablet"
}
},
{
"_index": "synonym",
"_type": "synonym",
"_id": "1",
"_score": 0.18232156,
"_source": {
"field1": "phone"
}
},
{
"_index": "synonym",
"_type": "synonym",
"_id": "3",
"_score": 0.18232156,
"_source": {
"field1": "laptop"
}
}
]
}
}
You can see that the score of tablet is always higher even when I search for laptop.
I know that is because I declared them as similar words.
However, I am trying to figure out how can I query so that document with the search term can appear in the first place, before the similar words in the result list.
It can be done by boosting, but there must be a simpler approach..
Multi-fields to your rescue.
Index the field1 in two ways, one with the synonym analyzer, and the other with a standard analyzer.
Now you can simply use a bool-should query to add score for match on field1 (synonym) and on field1.raw (standard).
So, your mappings should be like so:
PUT synonym
{
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 2,
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"phone, tablet, laptop => mobile"
]
}
}
}
}
},
"mappings": {
"synonym": {
"properties": {
"field1": {
"type": "text",
"analyzer": "synonym",
"search_analyzer": "synonym",
"fields": {
"raw": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
}
}
And you can query using:
GET synonyms/_search?search_type=dfs_query_then_fetch
{
"query": {
"bool": {
"should": [
{
"match": {
"field1": "tablet"
}
},
{
"match": {
"field1.raw": "tablet"
}
}
]
}
}
}
Notice: I've used search_type=dfs_query_then_fetch. Since you're testing on 3 shards and have very few documents, the scores you're getting aren't what they should be. This is because the frequencies are calculated per shard. You can use dfs_query_then_fetch while testing but it is discouraged for production. See: https://www.elastic.co/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch
Related
Here's a simplification of what I have:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
PUT my_index/_doc/1
{
"title": "Quick Foxes"
}
PUT my_index/_doc/2
{
"title": "Quick Fuxes"
}
PUT my_index/_doc/3
{
"title": "Foxes Quick"
}
PUT my_index/_doc/4
{
"title": "Foxes Slow"
}
I am trying to search for Quick Fo to test the autocomplete:
GET my_index/_search
{
"query": {
"match": {
"title": {
"query": "Quick Fo",
"operator": "and"
}
}
}
}
The problem is that this query also returns Foxes Quick where I expected 'Quick Foxes'
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"title": "Quick Foxes"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "3",
"_score": 0.5753642,
"_source": {
"title": "Foxes Quick" <<<----- WHY???
}
}
]
}
}
What can I tweak so that I can query a classic "autocomplete" where "Quick Fo" surely won't return "Foxes Quick"..... but only "Quick Foxes"?
---- ADDITIONAL INFO -----------------------
This worked for me:
PUT my_index1
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
PUT my_index1/_doc/1
{
"text": "Quick Brown Fox"
}
PUT my_index1/_doc/2
{
"text": "Quick Frown Fox"
}
PUT my_index1/_doc/3
{
"text": "Quick Fragile Fox"
}
GET my_index1/_search
{
"query": {
"match": {
"text": {
"query": "quick br",
"operator": "and"
}
}
}
}
Issue is due to your search analyzer autocomplete_search, in which you are using the lowercase tokenizer, so your search term Quick Fo will be divided into 2 terms, quick and fo (note lowercase) and will be matched against the tokens generated using the autocomplete analyzer on your indexed docs.
Now title Foxes Quick uses autocomplete analyzer and will be having both quick and fo tokens, hence it matches with the search term tokens.
you can simply use the _analyzer API, to check the tokens generated for your documents and as well as for your search term, to understand it better.
Please refer official ES doc https://www.elastic.co/guide/en/elasticsearch/guide/master/_index_time_search_as_you_type.html on how to implement the autocomplete, they also use different search time analyzer, but there is a certain limitation to it and can't solve all the use-cases(esp. if you have docs like yours), hence I implemented it using some other design, which is based on the business requirements.
Hope I was clear on explaining why it's returning the second doc in your case.
EDIT: Also in your case, IMO Match phrase prefix would be more useful.
I have a simple field of type "text" in my index.
"keywordName": {
"type": "text"
}
And I have these documents already inserted : "samsung", "samsung galaxy", "samsung cover", "samsung charger".
If I make a simple "match" query, the results are disturbing:
Query:
GET keywords/_search
{
"query": {
"match": {
"keywordName": "samsung"
}
}
}
Results:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1.113083,
"hits": [
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung galaxy",
"_score": 1.113083,
"_source": {
"keywordName": "samsung galaxy"
}
},
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung charger",
"_score": 0.9433406,
"_source": {
"keywordName": "samsung charger"
}
},
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung",
"_score": 0.8405092,
"_source": {
"keywordName": "samsung"
}
},
{
"_index": "keywords",
"_type": "keyword",
"_id": "samsung cover",
"_score": 0.58279467,
"_source": {
"keywordName": "samsung cover"
}
}
]
}
}
First Question : Why "samsung" has not the highest score?
Second Question : How can I make a query or analyser which gives me "samsung" as highest score?
Starting from the same index settings (analyzers, filters, mappings) as in my previous reply, I suggest the following solution. But, as I mentioned, you need to lay down all the requirements in terms of what you need to search for in this index and consider all of this as a complete solution.
DELETE test
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"custom_stop": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_stop",
"my_snow",
"asciifolding"
]
}
},
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_french_"
},
"my_snow": {
"type": "snowball",
"language": "French"
}
}
}
},
"mappings": {
"test": {
"properties": {
"keywordName": {
"type": "text",
"analyzer": "custom_stop",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
POST /test/test/_bulk
{"index":{}}
{"keywordName":"samsung galaxy"}
{"index":{}}
{"keywordName":"samsung charger"}
{"index":{}}
{"keywordName":"samsung cover"}
{"index":{}}
{"keywordName":"samsung"}
GET /test/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"keywordName": {
"query": "samsungs",
"operator": "and"
}
}
},
{
"term": {
"keywordName.raw": {
"value": "samsungs"
}
}
},
{
"fuzzy": {
"keywordName.raw": {
"value": "samsungs",
"fuzziness": 1
}
}
}
]
}
},
"size": 10
}
Here are my settings:
{
"countries": {
"aliases": {},
"mappings": {
"country": {
"properties": {
"countryName": {
"type": "string"
}
}
}
},
"settings": {
"index": {
"creation_date": "1472140045116",
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"synonyms_path": "synonym.txt"
}
},
"analyzer": {
"synonym": {
"filter": [
"synonym"
],
"tokenizer": "whitespace"
}
}
},
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "7-fKyD9aR2eG3BwUNdadXA",
"version": {
"created": "2030599"
}
}
},
"warmers": {}
}
}
My synonym.txt file is in the config folder inside the main elasticsearch folder.
Here is my query:
query: {
query_string: {
fields: ["countryName"],
default_operator: "AND",
query: searchInput,
analyzer: "synonym"
}
}
The words in synonym.txt are: us, u.s., united states.
So this doesn't work. What's interesting is that search works as normal, except for when I enter any of the words in the synonym.txt file. So for example, when I usually type in us into the search, I would get results. With this analyzer, us doesn't give me anything.
I've done close and open to my ES server, and still it doesn't work.
EDIT
An example of a document:
{
"_index": "countries",
"_type": "country",
"_id": "57aabeb80057405968de152b",
"_score": 1,
"_source": {
"countryName": "United States"
}
Example of searchInput (this is coming from the front-end):
united states
EDIT #2:
Here is my updated index config file:
{
"countries": {
"aliases": {},
"mappings": {},
"settings": {
"index": {
"number_of_shards": "5",
"creation_date": "1472219634083",
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"synonyms_path": "synonym.txt"
}
},
"analyzer": {
"synonym": {
"filter": [
"synonym"
],
"tokenizer": "whitespace"
}
}
},
"country": {
"properties": {
"countryName": {
"type": "string",
"analyzer": "synonym"
},
"number_of_replicas": "1",
"uuid": "50ZwpIVFTqeD_rJxlmd59Q",
"version": {
"created": "2030599"
}
}
},
"warmers": {}
}
}
}
}
When I try adding documents, and doing a search on said documents, the synonym analyzer does not work for me.
EDIT #3
Here are 2 documents in the index:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [{
"_index": "stocks",
"_type": "stock",
"_id": "2",
"_score": 1,
"_source": {
"countryName": "United States"
}
}, {
"_index": "stocks",
"_type": "stock",
"_id": "1",
"_score": 1,
"_source": {
"countryName": "Canada"
}
}]
}
}
You are close, but I suggest reading thoroughly this section from the documentation to understand better this functionality.
As a solution:
PUT /countries
{
"mappings": {
"country": {
"properties": {
"countryName": {
"type": "string",
"analyzer": "synonym"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"synonyms_path": "synonym.txt"
}
},
"analyzer": {
"synonym": {
"filter": [
"lowercase",
"synonym"
],
"tokenizer": "whitespace"
}
}
}
}
}
You need to delete the index and create it again with the mapping above.
Then use this query:
"query": {
"query_string": {
"fields": [
"countryName"
],
"default_operator": "AND",
"query": "united states"
}
}
Have you deleted/created the index after pushing the txt ?
I think you should remove the "synonyms": "" if you are using "synonyms_path"
I have the following elastic search configuration:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
},
"snow_filter" : {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"snow_filter",
"autocomplete_filter"
]
}
}
}
}
}
PUT /my_index/_mapping/my_type
{
"my_type": {
"properties": {
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "snowball"
},
"not": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "name": "Brown foxes" }
{ "index": { "_id": 2 }}
{ "name": "Yellow furballs" }
{ "index": { "_id": 3 }}
{ "name": "my discovery" }
{ "index": { "_id": 4 }}
{ "name": "myself is fun" }
{ "index": { "_id": 5 }}
{ "name": ["foxy", "foo"] }
{ "index": { "_id": 6 }}
{ "name": ["foo bar", "baz"] }
I am trying to get a search to only return item 6 that has a name of "foo bar" and I am not quite sure how. This is what I am doing right now:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "foo b"
}
}
}
}
I know it's a combination of how the tokenizer is splitting the word but sort of lost on how both be flexible and be strict enough to match this. I am guessing I need to do a multiple field on my mapping of name, but I am not sure. How can I fix the query and/or my mapping to satisfy my needs?
You're already close. Since your edge_ngram analyzer generates tokens of a minimum length of 1, and your query gets tokenized into "foo" and "b", and the default match query operator is "or", your query matches each document that has a term starting with "b" (or "foo"), three of the docs.
Using the "and" operator seems to do what you want:
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "foo b",
"operator": "and"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4451914,
"hits": [
{
"_index": "test_index",
"_type": "my_type",
"_id": "6",
"_score": 1.4451914,
"_source": {
"name": [
"foo bar",
"baz"
]
}
}
]
}
}
Here's the code I used to test it:
http://sense.qbox.io/gist/4f6fb7c1fdc6942023091ee1433d7490e04e7dea
I'm trying to do a simple query to my elasticsearch _type and match multiple fields with wildcards, my first attempt was like this:
POST my_index/my_type/_search
{
"sort" : { "date_field" : {"order" : "desc"}},
"query" : {
"filtered" : {
"filter" : {
"or" : [
{
"term" : { "field1" : "4848" }
},
{
"term" : { "field2" : "6867" }
}
]
}
}
}
}
This example will successfully match every record when field1 OR field2 are exactly equal to 4848 and 6867 respectively.
What I'm trying to do is to match on field1 any text that contains 4848 and field2 that contains 6867 but I'm not really sure how to do it.
I appreciate any help I can get :)
It sounds like your problem has mostly to do with analysis. The appropriate solution depends on the structure of your data and what you want to match. I'll provide a couple of examples.
First, let's assume that your data is such that we can get what we want just using the standard analyzer. This analyzer will tokenize text fields on whitespace, punctuation and symbols. So the text "1234-5678-90" will be broken into the terms "1234", "5678", and "90", so a "term" query or filter for any of those terms will match that document. More concretely:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"field1":{
"type": "string",
"analyzer": "standard"
},
"field2":{
"type": "string",
"analyzer": "standard"
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"field1": "1212-2323-4848","field2": "1234-5678-90"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"field1": "0000-0000-0000","field2": "0987-6543-21"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"field1": "1111-2222-3333","field2": "6867-4545-90"}
POST test_index/_search
{
"query": {
"filtered": {
"filter": {
"or": [
{
"term": { "field1": "4848" }
},
{
"term": { "field2": "6867" }
}
]
}
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"field1": "1212-2323-4848",
"field2": "1234-5678-90"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"field1": "1111-2222-3333",
"field2": "6867-4545-90"
}
}
]
}
}
(Explicitly writing "analyzer": "standard" is redundant since that is the default analyzer used if you do not specify one; I just wanted to make it obvious.)
On the other hand, if the text is embedded in such a way that the standard analysis doesn't provide what you want, say something like "121223234848" and you want to match on "4848", you will have to do something little more sophisticated, using ngrams. Here is an example of that (notice the difference in the data):
DELETE /test_index
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"field1":{
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"field2":{
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"field1": "121223234848","field2": "1234567890"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"field1": "000000000000","field2": "0987654321"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"field1": "111122223333","field2": "6867454590"}
POST test_index/_search
{
"query": {
"filtered": {
"filter": {
"or": [
{
"term": { "field1": "4848" }
},
{
"term": { "field2": "6867" }
}
]
}
}
}
}
...
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"field1": "121223234848",
"field2": "1234567890"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"field1": "111122223333",
"field2": "6867454590"
}
}
]
}
}
There is a lot going on here, so I won't attempt to explain it in this post. If you want more explanation I would encourage you to read this blog post: http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams. Hope you'll forgive the shameless plug. ;)
Hope that helps.