I used following mapping:
I have modified english analyzer to use ngram analyzer as follows,so that I should be able to search under following scenarios :
1] partial search and special character search
2] To get advantage of language analyzers
{
"settings": {
"analysis": {
"analyzer": {
"english_ngram": {
"type": "custom",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_stemmer",
"ngram_filter"
],
"tokenizer": "whitespace"
}
},
"filter": {
"english_stop": {
"type": "stop"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 25
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"en": {
"type": "string",
"analyzer": "english_ngram"
}
}
}
}
}
}
}
Indexed my data as follows:
PUT http://localhost:9200/movies/movie/1
{
"title" : "$peci#l movie"
}
Query as follows:
{
"query": {
"multi_match": {
"query": "$peci#44 m11ov",
"fields": ["title.en"],
"operator":"and",
"type": "most_fields",
"minimum_should_match": "75%"
}
}
}
In query I am looking for "$peci#44 m11ov" string ,ideally I should not get results for this.
Anything wrong in here ?
This is a result of ngram tokenization. When you tokenize a string $peci#l movie your analyzer produces tokens like $, $p, $pe, etc. Your query also produces most of these tokens. Though these matches will have a lower score than a complete match. If it's critical for you to exclude these false positive matches, you can try to set a threshold using min_score option https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-min-score.html
Related
Settings:
{
"settings": {
"analysis": {
"analyzer": {
"idx_analyzer_ngram": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding",
"edgengram_filter_1_32"
],
"tokenizer": "ngram_alltokenchar_tokenizer_1_32"
},
"ngrm_srch_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
},
"tokenizer": {
"ngram_alltokenchar_tokenizer_1_32": {
"token_chars": [
"letter",
"whitespace",
"punctuation",
"symbol",
"digit"
],
"min_gram": "1",
"type": "nGram",
"max_gram": "32"
}
}
}
}
}
Mappings:
{
"properties": {
"TITLE": {
"type": "string",
"fields": {
"untouched": {
"index": "not_analyzed",
"type": "string"
},
"ngramanalyzed": {
"search_analyzer": "ngrm_srch_analyzer",
"index_analyzer": "idx_analyzer_ngram",
"type": "string",
"term_vector": "with_positions_offsets"
}
}
}
}
}
Query:
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "have some ha",
"fields": [
"TITLE.ngramanalyzed"
],
"default_operator": "and"
}
}
}
},
"highlight": {
"fields": {
"TITLE.ngramanalyzed": {}
}
}
}
I have document indexed with TITLE have some happy meal. When I search have some, I am able to get proper highlights.
<em>have</em> <em>some</em> happy meal
As i type more have some ha, the highlight results are not as expected.
<em>ha</em>ve <em>some</em> <em>ha</em>ppy meal
The have word gets partially highlighted as ha.
I would expect it to highlight the longest matching token, because with an ngrams with min size = 1, this gives me a highlight of 1 or more char while there should be another matching token of 4 or 5 chars (for example: have should also be highlighted along with ha being highlighted.
I am not able to find any solution for the same. Please suggest.
I'm having problem with an elasticsearch query.
I want to be able to sort the results but elasticsearch is ignoring the sort tag. Here my query:
{
"sort": [{
"title": {"order": "desc"}
}],
"query":{
"term": { "title": "pagos" }
}
}
However, when I remove the query part and I send only the sort tag, it works.
Can anyone point me out the correct way?
I also tried with the following query, which is the complete query that I have:
{
"sort": [{
"title": {"order": "asc"}
}],
"query":{
"bool":{
"should":[
{
"match":{
"title":{
"query":"Pagos",
"boost":9
}
}
},
{
"match":{
"description":{
"query":"Pagos",
"boost":5
}
}
},
{
"match":{
"keywords":{
"query":"Pagos",
"boost":3
}
}
},
{
"match":{
"owner":{
"query":"Pagos",
"boost":2
}
}
}
]
}
}
}
Settings
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 15,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "asciifolding"]
},
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"autocomplete_filter"
]
}
}
}
}
}
Mappings
{
"objects": {
"properties": {
"id": { "type": "string", "index": "not_analyzed" },
"type": { "type": "string" },
"title": { "type": "string", "boost": 9, "analyzer": "autocomplete", "search_analyzer": "standard" },
"owner": { "type": "string", "boost": 2 },
"description": { "type": "string", "boost": 4 },
"keywords": { "type": "string", "boost": 1 }
}
}
}
Thanks in advance!
The field "title" in your document is an analyzed string field, which is also a multivalued field, which means elasticsearch will split the contents of the field into tokens and stores it separately in the index.
You probably want to sort the "title" field alphabetically on the first term, then on the second term, and so forth, but elasticsearch doesn’t have this information at its disposal at sort time.
Hence you can change your mapping of the "title" field from:
{
"title": {
"type": "string", "boost": 9, "analyzer": "autocomplete", "search_analyzer": "standard"
}
}
into a multifield mapping like this:
{
"title": {
"type": "string", "boost": 9, "analyzer": "autocomplete", "search_analyzer":"standard",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
Now execute your search based on analyzed "title" field and sort based on the not_analyzed "title.raw" field
{
"sort": [{
"title.raw": {"order": "desc"}
}],
"query":{
"term": { "title": "pagos" }
}
}
It is beautifully explained here: String Sorting and Multifields
Hello all i am facing two problems in ES
I have a 'city' 'New York' in ES now i want to write a term filter such that if given string exactly matches "New York" then only it returns but what is happening is that when my filter matches "New" OR "York" for both it returns "New York" but it is not returning anything for "New York" my mapping is given below please tell me which analyzer or tokenizer should i use inside mapping
Here are the settings and mapping:
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": ["synonym"]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
},
mappings : {
"restaurant" : {
properties:{
address : {
properties:{
city : {"type" : "string", "analyzer": "synonym"},
}
}
}
}
Second problem is that when i am trying to use wildcard query on lowercase example "new*" then ES is not returning not anything but when i am trying to search uppercase example "New*" now it is returning "New York" now i in this second case i want to write my city mappings such that when i search for lowercase or uppercase for both ES returns the same thing i have seen ignore case and i have set it to false inside synonyms but still i am not able to search for both lowercase and uppercases.
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt",
"ignore_case": true // See here
}
I believe you didn't provide enough details, but hoping that my attempt will generate questions from you, I will post what I believe it should be a step forward:
The mapping:
PUT test
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
},
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt",
"ignore_case": true
}
}
}
}
},
"mappings": {
"restaurant": {
"properties": {
"address": {
"properties": {
"city": {
"type": "string",
"analyzer": "synonym",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"raw_ignore_case": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
}
}
}
}
}
}
Test data:
POST /test/restaurant/1
{
"address": {"city":"New York"}
}
POST /test/restaurant/2
{
"address": {"city":"new york"}
}
Query for the first problem:
GET /test/restaurant/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"address.city.raw": "New York"
}
}
}
}
}
Query for the second problem:
GET /test/restaurant/_search
{
"query": {
"query_string": {
"query": "address.city.raw_ignore_case:new*"
}
}
}
In the mapping char_filter section of elasticsearch mapping, its kind of vague and I'm having a lot of difficulty understanding if and how to use charfilter analyzer: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
Basically the data we are storing in the index are ids of type String that look like this: "008392342000". I want to be able to search such ids when query terms actually contain a hyphen or trailing space like this: "008392342-000 ".
How would you advise I set the analyzer like?
Currently this is the definition of the field:
"mappings": {
"client": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
Here is the settings for the index containing analyzer etc.
"settings": {
"analysis": {
"filter": {
"autocomplete_ngram": {
"max_gram": 15,
"min_gram": 1,
"type": "edge_ngram"
},
"ngram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 8
}
},
"analyzer": {
"lowercase_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_index": {
"filter": [
"lowercase",
"autocomplete_ngram"
],
"tokenizer": "keyword"
},
"ngram_index": {
"filter": [
"ngram_filter",
"lowercase"
],
"tokenizer": "keyword"
},
"autocomplete_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
},
"ngram_search": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"index": {
"number_of_shards": 6,
"number_of_replicas": 1
}
}
}
You haven't provided your actual analyzers, what data goes in and what your expectations are, but based on the info you provided I would start with this:
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
"-=>"
]
}
},
"analyzer": {
"autocomplete_search": {
"tokenizer": "keyword",
"char_filter": [
"my_mapping"
],
"filter": [
"trim"
]
},
"autocomplete_index": {
"tokenizer": "keyword",
"filter": [
"trim"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"ucn": {
"type": "multi_field",
"fields": {
"ucn_autoc": {
"type": "string",
"index": "analyzed",
"index_analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
},
"ucn": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
The char_filter would replace - with nothing: -=>. I would, also, use the trim filter to get rid of any trailing or leading white spaces. No idea what your autocomplete_index analyzer you have, I just used a keyword one.
Testing the analyzer GET /my_index/_analyze?analyzer=autocomplete_search&text= 0123-34742-000 results in:
"tokens": [
{
"token": "012334742000",
"start_offset": 0,
"end_offset": 17,
"type": "word",
"position": 1
}
]
which means it does eliminate the - and the white spaces.
And the typical query would be:
{
"query": {
"match": {
"ucn.ucn_autoc": " 0123-34742-000 "
}
}
}
I have a query that should search for lowercase terms.
Actually I just had a index_analyzer with a lowercase filter, but I wanted to add also a search_analyzer so I could do case-insensitive searches.
"analysis": {
"analyzer" : {
"DefaultAnalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
],
"char_filter": ["punctuation"]
},
"MyAnalyzer": {
"type": "custom",
"tokenizer": "first_letter",
"filter": [
"lowercase"
]
},
So I just thought to add the same analyzer as search_analyzer into the mapping
"index_analyzer": "DefaultAnalyzer",
"search_analyzer": "DefaultAnalyzer",
"dynamic" : false,
"_source": { "enabled": true },
"properties" : {
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"store": true
},
"startletter": {
"type": "string",
"index_analyzer": "MyAnalyzer",
"search_analyzer": "MyAnalyzer",
"store": true
}
}
},
Doing like that, if I manually query Elastic Search with
curl -XGET host:9200/my-index/_analyze -d 'Test'
I see that the query term is correctly lowercased
{
"tokens": [
{
"token": "test",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 1
}
]
}
But executing from the code
if I use an uppercase search term ES returns zero hits (even if we saw that the search_analyzer is applied)
if I use a lowercase search term ES returns me the right number of result hits (hundreds)
While I would like to have the same result independently from the case.
In the code I'm just creating a query with a term filter, that is like that
{
"filter": {
"term": {
"name.startletter": "O"
}
},
"size": 10000,
"query": {
"match_all": {}
}
}
What I'm doing wrong? Why am I not getting any result?
The problem is that you are using a Term Filter. A Term Filter does not analyze the text being used:
Term Filter
Filters documents that have fields that contain a term (not analyzed).
Similar to term query, except that it acts as a filter.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-term-filter.html
Since it does not analyze, it does not use the analyzer that you have defined.
You generally want to use Term filters and queries with fields that are not analyzed. Change your filter type to something that will analyze during the query.
I think, you are using MyAnalyzer to get start letter of indexed, Your analyzer don't work in that way. I've dome some test, and finally come up with solution.
First, create index and mapping (+ settings)
curl -XPUT "http://localhost:9200/t1" -d'
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"DefaultAnalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
},
"MyAnalyzer": {
"type": "custom",
"tokenizer": "token_letter",
"filter": [
"one_token","lowercase"
]
}
},
"tokenizer": {
"token_letter": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "1",
"token_chars": [
"letter",
"digit"
]
}
},
"filter": {
"one_token": {
"type": "limit",
"max_token_count": 1
}
}
}
}
},
"mappings": {
"t2": {
"index_analyzer": "DefaultAnalyzer",
"search_analyzer": "DefaultAnalyzer",
"dynamic": false,
"_source": {
"enabled": true
},
"properties": {
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"store": true
},
"startletter": {
"type": "string",
"index_analyzer": "MyAnalyzer",
"search_analyzer": "simple",
"store": true
}
}
}
}
}
}
}'
And, now write a data.
curl -XPUT "http://localhost:9200/t1/t2/1" -d'
{
"name" :"Oliver Khan"
}'
Now, Here is fun part, Just a query and facet to see what is indexed.
curl -XPOST "http://localhost:9200/t1/t2/_search" -d'
{
"filter": {
"term": {
"name.startletter": "O"
}
},
"size": 10000,
"query": {
"match_all": {}
},
"facets": {
"tf": {
"terms": {
"field": "name.startletter",
"size": 10
}
}
}
}'
This gives me analyzed text, as facet output, so I can check if analyzer is working.
Hope this helps!!