I've read a lot and it seems that using EdgeNGrams is a good way to go for implementing an autocomplete feature for search applications. I've already configured the EdgeNGrams in my settings for my index.
PUT /bigtestindex
{
"settings":{
"analysis":{
"analyzer":{
"autocomplete":{
"type":"custom",
"tokenizer":"standard",
"filter":[ "standard", "stop", "kstem", "ngram" ]
}
},
"filter":{
"edgengram":{
"type":"ngram",
"min_gram":2,
"max_gram":15
}
},
"highlight": {
"pre_tags" : ["<em>"],
"post_tags" : ["</em>"],
"fields": {
"title.autocomplete": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
}
}
So if in my settings I have the EdgeNGram filter configured how do I add that to the search query?
What I have so far is a match query with highlight:
GET /bigtestindex/doc/_search
{
"query": {
"match": {
"content": {
"query": "thing and another thing",
"operator": "and"
}
}
},
"highlight": {
"pre_tags" : ["<em>"],
"post_tags" : ["</em>"],
"field": {
"_source.content": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
How would I add autocomplete to the search query using EdgeNGrams configured in the settings for the index?
UPDATE
For the mapping, would it be ideal to do something like this:
"title": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
},
Or do I need to use multi_field type:
"title": {
"type": "multi_field",
"fields": {
"title": {
"type": "string"
},
"autocomplete": {
"analyzer": "autocomplete",
"type": "string",
"index": "not_analyzed"
}
}
},
I'm using ES 1.4.1 and want to use the title field for autocomplete purposes.... ?
Short answer: you need to use it in a field mapping. As in:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"stop",
"kstem",
"ngram"
]
}
},
"filter": {
"edgengram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"doc": {
"properties": {
"field1": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
For a bit more discussion, see:
http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams
and
http://blog.qbox.io/an-introduction-to-ngrams-in-elasticsearch
Also, I don't think you want the "highlight" section in your index definition; that belongs in the query.
EDIT: Upon trying out your code, there are a couple of problems with it. One was the highlight issue I already mentioned. Another is that you named your filter "edgengram", even though it is of type "ngram" rather than type "edgeNGram", but then you referenced the filter "ngram" in your analyzer, which will use the default ngram filter, which probably doesn't give you what you want. (Hint: you can use term vectors to figure out what your analyzer is doing to your documents; you probably want to turn them off in production, though.)
So what you actually want is probably something like this:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"stop",
"kstem",
"edgengram_filter"
]
}
},
"filter": {
"edgengram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"doc": {
"properties": {
"content": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
When I indexed these two docs:
POST test_index/doc/_bulk
{"index":{"_id":1}}
{"content":"hello world"}
{"index":{"_id":2}}
{"content":"goodbye world"}
And ran this query (there was an error in your "highlight" block as well; should have said "fields" rather than "field")"
POST /test_index/doc/_search
{
"query": {
"match": {
"content": {
"query": "good wor",
"operator": "and"
}
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"fields": {
"content": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
I get back this response, which seems to be what you're looking for, if I understand you correctly:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2712221,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.2712221,
"_source": {
"content": "goodbye world"
},
"highlight": {
"content": [
"<em>goodbye</em> <em>world</em>"
]
}
}
]
}
}
Here is some code I used to test it out:
http://sense.qbox.io/gist/3092992993e0328f7c4ee80e768dd508a0bc053f
Related
Because in my mapping below, when I put the URL field with the analyzer different from the title and description fields when I do a search simultaneously in the three fields, it doesn't return anything even if I have one of the three words below in each field
{
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "0",
"analysis": {
"filter": {
"stemmer_plural_portugues": {
"name": "minimal_portuguese",
"stopwords" : ["http", "https", "ftp", "www"],
"type": "stemmer"
},
"synonym_filter": {
"type": "synonym",
"lenient": true,
"synonyms_path": "analysis/synonym.txt",
"updateable" : true
},
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
},
"analyzer": {
"analyzer_customizado": {
"filter": [
"lowercase",
"stemmer_plural_portugues",
"asciifolding",
"synonym_filter",
"shingle_filter" ],
"tokenizer": "standard"
},
"analyzer_url": {
"filter": [
"lowercase",
"stemmer_plural_portugues",
"asciifolding" ],
"tokenizer": "lowercase"
}
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "long"
},
"data": {
"type": "date"
},
"quebrado": {
"type": "byte"
},
"pgrk": {
"type": "integer"
},
"url_length": {
"type": "integer"
},
"title": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"description": {
"analyzer": "analyzer_customizado",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"url": {
"analyzer": "analyzer_url",
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
}
in the query below the three words exist each of the fields, but it only returns results if I search for words that are in the title and in the description, if I also search for the word that is in the URL field that has the different analyzer does not return anything.
if I search only the words that are in the title and description field you will normally find, if I search only the word that is in the URL field also finds it, however if I search for the three words that exist in the three fields it doesn't return anything.
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "carro moto aviao",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "and"
}
}
}
The issue is that you are using the operator as and which means all three words carro moto aviao must present, can you change it to OR and see if its returns result.
Adding a working example with you mapping, sample data and with search query with or parameter and confirming that it works.
Sample doc
{
"title": "carro",
"description": "moto",
"url": "aviao"
}
Search query with OR param
{
"from": 0,
"size": 10,
"query": {
"multi_match": {
"query": "carro moto aviao",
"type": "cross_fields",
"fields": [
"title",
"description",
"url"
],
"operator": "or"
}
}
}
Search result
"hits": [
{
"_index": "jean",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"title": "carro",
"description": "moto",
"url": "aviao"
}
}
]
Note: confirmed that it doesn't work with and param if your query.
I have implemented auto suggest using elastic search where I am giving suggestions to users based on typed value 'where'. Most of the part works fine if I type full word or few starting characters of word. I want to highlight specific characters typed by the user, say for example user types 'ca' then suggestions should highlight 'California' only and not whole word 'California'
Highlight tag should show result like <b>Ca</b>lifornia and not <b>California</b>.
Here is my index settings
{
"settings": {
"index": {
"analysis": {
"filter": {
"edge_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 50
},
"lowercase_filter":{
"type":"lowercase",
"language": "greek"
},
"metro_synonym": {
"type": "synonym",
"synonyms_path": "metro_synonyms.txt"
},
"profession_specialty_synonym": {
"type": "synonym",
"synonyms_path": "profession_specialty_synonyms.txt"
}
},
"analyzer": {
"auto_suggest_analyzer": {
"filter": [
"lowercase",
"edge_filter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"auto_suggest_search_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "whitespace"
},
"lowercase": {
"filter": [
"trim",
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"properties": {
"what_auto_suggest": {
"type": "text",
"analyzer": "auto_suggest_analyzer",
"search_analyzer": "auto_suggest_search_analyzer",
"fields": {
"raw":{
"type":"keyword"
}
}
},
"company": {
"type": "text",
"analyzer": "lowercase"
},
"where_auto_suggest": {
"type": "text",
"analyzer": "auto_suggest_analyzer",
"search_analyzer": "auto_suggest_search_analyzer",
"fields": {
"raw":{
"type":"keyword"
}
}
},
"tags_auto_suggest": {
"type": "text",
"analyzer": "auto_suggest_analyzer",
"search_analyzer": "auto_suggest_search_analyzer",
"fields": {
"raw":{
"type":"keyword"
}
}
}
}
}
}
Query i am using to pull suggestions -
GET /autosuggest_index_test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"where_auto_suggest": {
"query": "ca",
"operator": "and"
}
}
}
]
}
},
"aggs": {
"NAME": {
"terms": {
"field": "where_auto_suggest.raw",
"size": 10
}
}
},
"highlight": {
"pre_tags": [
"<b>"
],
"post_tags": [
"</b>"
],
"fields": {
"where_auto_suggest": {
}
}
}
}
One of json result that I am getting -
{
"_index" : "autosuggest_index_test",
"_type" : "_doc",
"_id" : "Calabasas CA",
"_score" : 5.755663,
"_source" : {
"where_auto_suggest" : "Calabasas CA"
},
"highlight" : {
"where_auto_suggest" : [
"<b>Calabasas</b> <b>CA</b>"
]
}
}
Can someone please suggest, how to get output here (in the where_auto_suggest) like - "<b>Ca</b>labasas <b>CA</b>"
I don't really know why but if you use a edge_ngram tokenizer instead of an edge_ngram filter you will have highlighted characters instead of highlighted words.
So in your settings, you could declare such a tokenizer :
"settings": {
"index": {
"analysis": {
"tokenizer": {
"edge_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 50,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
...
}
}
}
And change your analyzer to :
"analyzer": {
"auto_suggest_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "edge_tokenizer"
}
...
}
Thus your example request will return
{
...
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "autosuggest_index_test",
"_type": "_doc",
"_id": "grIzo28BY9R4-IxJhcFv",
"_score": 0.2876821,
"_source": {
"where_auto_suggest": "california"
},
"highlight": {
"where_auto_suggest": [
"<b>ca</b>lifornia"
]
}
}
]
}
...
}
I'm trying to achieve google style autocomplete & autocorrection with elasticsearch.
Mappings :
POST music
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"song": {
"properties": {
"song_field": {
"type": "string",
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"suggest": {
"type": "completion",
"analyzer": "simple",
"search_analyzer": "simple",
"payloads": true
}
}
}
}
}
Docs:
POST music/song
{
"song_field" : "beautiful queen",
"suggest" : "beautiful queen"
}
POST music/song
{
"song_field" : "beautiful",
"suggest" : "beautiful"
}
I expect that when user types: "beaatiful q" he will get something like beautiful queen (beaatiful is corrected to beautiful and q is completed to queen).
I've tried the following query:
POST music/song/_search?search_type=dfs_query_then_fetch
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatiful q",
"completion": {
"field": "suggest"
}
}
},
"query": {
"match": {
"song_field": {
"query": "beaatiful q",
"fuzziness": 2
}
}
}
}
Unfortunately, Completion suggester doesn't allow any typos so I get this response:
"suggest": {
"didYouMean": [
{
"text": "beaatiful q",
"offset": 0,
"length": 11,
"options": []
}
]
}
In addition, search gave me these results (beautiful ranked higher although user started to wrote "queen"):
"hits": [
{
"_index": "music",
"_type": "song",
"_id": "AVUj4Y5NancUpEdFLeLo",
"_score": 0.51315063,
"_source": {
"song_field": "beautiful"
"suggest": "beautiful"
}
},
{
"_index": "music",
"_type": "song",
"_id": "AVUj4XFAancUpEdFLeLn",
"_score": 0.32071912,
"_source": {
"song_field": "beautiful queen"
"suggest": "beautiful queen"
}
}
]
UPDATE !!!
I found out that I can use fuzzy query with completion suggester, but now I get no suggestions when querying (fuzzy only supports 2 edit distance):
POST music/song/_search
{
"size": 10,
"suggest": {
"didYouMean": {
"text": "beaatefal q",
"completion": {
"field": "suggest",
"fuzzy" : {
"fuzziness" : 2
}
}
}
}
}
I still expect "beautiful queen" as suggestion response.
When you want to provide 2 or more words as search suggestions, I have found out (the hard way), its not worth it to use ngrams or edgengrams in Elasticsearch.
Using the Shingles token filter and the shingles analyzer will provide you with multi-word phrases and if you couple that with the match_phrase_prefix it should give you the functionality your looking for.
Basically something like this:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
}
}
And don't forget to do your mapping:
{
"my_type": {
"properties": {
"title": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
}
}
}
}
Ngrams and edgengrams are going tokenize single characters, whereas the Shingles analyzer and filters, groups letters (making words) and provide a much more efficient way of producing and searching for phrases. I spent alot of time messing with the 2 above until I saw Shingles mentioned and read up on it. Much better.
I have my analyzers set like this:
"analyzer": {
"edgeNgram_autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "autocomplete"]
},
"full_name": {
"filter":["standard","lowercase","asciifolding"],
"type":"custom",
"tokenizer":"standard"
}
My filter:
"filter": {
"autocomplete": {
"type": "edgeNGram",
"side":"front",
"min_gram": 1,
"max_gram": 50
}
Name field analyzer:
"textbox": {
"_parent": {
"type": "document"
},
"properties": {
"text": {
"fields": {
"text": {
"type":"string",
"analyzer":"full_name"
},
"autocomplete": {
"type": "string",
"index_analyzer": "edgeNgram_autocomplete",
"search_analyzer": "full_name",
"analyzer": "full_name"
}
},
"type":"multi_field"
}
}
}
Put all together, makes up my mapping for docstore index:
PUT http://localhost:9200/docstore
{
"settings": {
"analysis": {
"analyzer": {
"edgeNgram_autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "autocomplete"]
},
"full_name": {
"filter":["standard","lowercase","asciifolding"],
"type":"custom",
"tokenizer":"standard"
}
},
"filter": {
"autocomplete": {
"type": "edgeNGram",
"side":"front",
"min_gram": 1,
"max_gram": 50
} }
}
},
"mappings": {
"space": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
},
"document": {
"_parent": {
"type": "space"
},
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
},
"textbox": {
"_parent": {
"type": "document"
},
"properties": {
"bbox": {
"type": "long"
},
"text": {
"fields": {
"text": {
"type":"string",
"analyzer":"full_name"
},
"autocomplete": {
"type": "string",
"index_analyzer": "edgeNgram_autocomplete",
"search_analyzer": "full_name",
"analyzer":"full_name"
}
},
"type":"multi_field"
}
}
},
"entity": {
"_parent": {
"type": "document"
},
"properties": {
"bbox": {
"type": "long"
},
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Add a space to hold all docs:
POST http://localhost:9200/docstore/space
{
"name": "Space 1"
}
When user enters word: proj
this should return, all text:
SampleProject
Sample Project
Project Name
myProjectname
firstProjectName
my ProjectName
But it returns nothing.
My query:
POST http://localhost:9200/docstore/textbox/_search
{
"query": {
"match": {
"text": "proj"
}
},
"filter": {
"has_parent": {
"type": "document",
"query": {
"term": {
"name": "1-a1-1001.pdf"
}
}
}
}
}
If I search by project, I get:
{ "took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 3.0133555,
"hits": [
{
"_index": "docstore",
"_type": "textbox",
"_id": "AVRuV2d_f4y6IKuxK35g",
"_score": 3.0133555,
"_routing": "AVRuVvtLf4y6IKuxK33f",
"_parent": "AVRuV2cMf4y6IKuxK33g",
"_source": {
"bbox": [
8750,
5362,
9291,
5445
],
"text": [
"Sample Project"
]
}
},
{
"_index": "docstore",
"_type": "textbox",
"_id": "AVRuV2d_f4y6IKuxK35Y",
"_score": 2.4106843,
"_routing": "AVRuVvtLf4y6IKuxK33f",
"_parent": "AVRuV2cMf4y6IKuxK33g",
"_source": {
"bbox": [
8645,
5170,
9070,
5220
],
"text": [
"Project Name and Address"
]
}
}
]
}
}
Maybe my edgengram is not suited for this?
I am saying:
side":"front"
Should I do it differently?
Does anyone know what I am doing wrong?
The problem is with the autocomplete indexing analyzer field name.
Change:
"index_analyzer": "edgeNgram_autocomplete"
To:
"analyzer": "edgeNgram_autocomplete"
And also search like (#Andrei Stefan) showed in his answer:
POST http://localhost:9200/docstore/textbox/_search
{
"query": {
"match": {
"text.autocomplete": "proj"
}
}
}
And it will work as expected!
I have tested your configuration on Elasticsearch 2.3
By the way, type multi_field is deprecated.
Hope I have managed to help :)
Your query should actually try to match on text.autocomplete and not text:
"query": {
"match": {
"text.autocomplete": "proj"
}
}
I'm having problem with an elasticsearch query.
I want to be able to sort the results but elasticsearch is ignoring the sort tag. Here my query:
{
"sort": [{
"title": {"order": "desc"}
}],
"query":{
"term": { "title": "pagos" }
}
}
However, when I remove the query part and I send only the sort tag, it works.
Can anyone point me out the correct way?
I also tried with the following query, which is the complete query that I have:
{
"sort": [{
"title": {"order": "asc"}
}],
"query":{
"bool":{
"should":[
{
"match":{
"title":{
"query":"Pagos",
"boost":9
}
}
},
{
"match":{
"description":{
"query":"Pagos",
"boost":5
}
}
},
{
"match":{
"keywords":{
"query":"Pagos",
"boost":3
}
}
},
{
"match":{
"owner":{
"query":"Pagos",
"boost":2
}
}
}
]
}
}
}
Settings
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 15,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "asciifolding"]
},
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"autocomplete_filter"
]
}
}
}
}
}
Mappings
{
"objects": {
"properties": {
"id": { "type": "string", "index": "not_analyzed" },
"type": { "type": "string" },
"title": { "type": "string", "boost": 9, "analyzer": "autocomplete", "search_analyzer": "standard" },
"owner": { "type": "string", "boost": 2 },
"description": { "type": "string", "boost": 4 },
"keywords": { "type": "string", "boost": 1 }
}
}
}
Thanks in advance!
The field "title" in your document is an analyzed string field, which is also a multivalued field, which means elasticsearch will split the contents of the field into tokens and stores it separately in the index.
You probably want to sort the "title" field alphabetically on the first term, then on the second term, and so forth, but elasticsearch doesn’t have this information at its disposal at sort time.
Hence you can change your mapping of the "title" field from:
{
"title": {
"type": "string", "boost": 9, "analyzer": "autocomplete", "search_analyzer": "standard"
}
}
into a multifield mapping like this:
{
"title": {
"type": "string", "boost": 9, "analyzer": "autocomplete", "search_analyzer":"standard",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
Now execute your search based on analyzed "title" field and sort based on the not_analyzed "title.raw" field
{
"sort": [{
"title.raw": {"order": "desc"}
}],
"query":{
"term": { "title": "pagos" }
}
}
It is beautifully explained here: String Sorting and Multifields