I am searching for a phrase in a email body. Need to get the exact data filtered like, if I search for 'Avenue New', it should return only results which has the phrase 'Avenue New' not 'Avenue Street', 'Park Avenue'etc
My mapping is like:
{
"exchangemailssql": {
"aliases": {},
"mappings": {
"email": {
"dynamic_templates": [
{
"_default": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"doc_values": true,
"type": "keyword"
}
}
}
],
"properties": {
"attachments": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"body": {
"type": "text",
"analyzer": "keylower",
"fielddata": true
},
"count": {
"type": "short"
},
"emailId": {
"type": "long"
}
}
}
},
"settings": {
"index": {
"refresh_interval": "3s",
"number_of_shards": "1",
"provided_name": "exchangemailssql",
"creation_date": "1500527793230",
"analysis": {
"filter": {
"nGram": {
"min_gram": "4",
"side": "front",
"type": "edge_ngram",
"max_gram": "100"
}
},
"analyzer": {
"keylower": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
},
"email": {
"filter": [
"lowercase",
"unique",
"nGram"
],
"type": "custom",
"tokenizer": "uax_url_email"
},
"full": {
"filter": [
"lowercase",
"snowball",
"nGram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "0",
"uuid": "2XTpHmwaQF65PNkCQCmcVQ",
"version": {
"created": "5040099"
}
}
}
}
}
I have given the search query like:
{
"query": {
"match_phrase": {
"body": "Avenue New"
}
},
"highlight": {
"fields" : {
"body" : {}
}
}
}
The problem here is that you're tokenizing the full body content using the keyword tokenizer, i.e. it will be one big lowercase string and you cannot search inside of it.
If you simply change the analyzer of your body field to standard instead of keylower, you'll find what you need using the match_phrase query.
"body": {
"type": "text",
"analyzer": "standard", <---change this
"fielddata": true
},
Related
I am working on a project to perform multilingual full-text search using Elasticsearch.
one field can contain a word combination of different languages or transliteration. for example in the English text may contain Armenian words. or Russian words in the Armenian text.
and i am trying now to configure text analysis with language analyzer.
How correct is my analyzer, And will it work at all ?
PUT /example{
"settings": {
"analysis": {
"filter": {
"armenian_stop": {
"type": "stop",
"stopwords": "_armenian_"
},
"armenian_keywords": {
"type": "keyword_marker",
"keywords": ["օրինակ"]
},
"armenian_stemmer": {
"type": "stemmer",
"language": "armenian"
},
"russian_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"russian_keywords": {
"type": "keyword_marker",
"keywords": ["пример"]
},
"russian_stemmer": {
"type": "stemmer",
"language": "russian"
},
"graph_synonyms": {
"type": "synonym",
"synonyms_path": "analysis/synonym.txt"
}
},
"analyzer": {
"rebuilt_armenian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"armenian_stop",
"armenian_keywords",
"armenian_stemmer",
"russian_stop",
"russian_keywords",
"russian_stemmer",
"graph_synonyms"
]
}
}
}},"mappings": {
"properties": {
"age": { "type": "integer" },
"email": { "type": "keyword" },
"name": { "type": "text", "analyzer": "rebuilt_armenian" } ,
"location": {
"type": "geo_point"
}
}}}
I work in a different way with multilinguals. It seems that in your case you don't know what language it is before indexing. In my current scenario, for each language I create a field, using "fields", and for each field I use the language-specific analyzer.
{
"settings": {
"analysis": {
"filter": {
"armenian_stop": {
"type": "stop",
"stopwords": "_armenian_"
},
"armenian_keywords": {
"type": "keyword_marker",
"keywords": [
"օրինակ"
]
},
"armenian_stemmer": {
"type": "stemmer",
"language": "armenian"
},
"russian_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"russian_keywords": {
"type": "keyword_marker",
"keywords": [
"пример"
]
},
"russian_stemmer": {
"type": "stemmer",
"language": "russian"
},
"graph_synonyms": {
"type": "synonym",
"synonyms_path": "analysis/synonym.txt"
}
},
"analyzer": {
"rebuilt_armenian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"armenian_stop",
"armenian_keywords",
"armenian_stemmer"
]
},
"rebuilt_russian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"age": {
"type": "integer"
},
"email": {
"type": "keyword"
},
"name": {
"type": "text",
"fields": {
"ar": {
"type": "text",
"analyzer": "rebuilt_armenian"
},
"ru": {
"type": "text",
"analyzer": "rebuilt_russian"
}
}
},
"location": {
"type": "geo_point"
}
}
}
}
And during the indexing and during the search I don't know what language the text is in.
and as far as I understand it is necessary to search for specific fields, if you search for example by "name" then the standard parser will work
{
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [ "name.ar", "name.ru"],
"query": "phone"
}
}
],
"filter": [
{
"geo_distance": {
"distance": "25km",
"location": {
"lat": 40.79420000 ,
"lon": 43.84528000
}
}
}
]
}
}
}
You can try to check your analyzer with the analyzer API: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
Enter some mixed text and see if the result is what you want.
Sometimes it is also ok to just use standard analyzer and forget about eliminating language-specific stopwords or stemming.
I am looking for a way to make ES search the data with multiple analyzers.
NGram analyzer and one or few language analyzers.
Possible solution will be to use multi-fields and explicitly declare which analyzer to use for each field.
For example, to set the following mappings:
"mappings": {
"my_entity": {
"properties": {
"my_field": {
"type": "text",
"fields": {
"ngram": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"spanish": {
"type": "string",
"analyzer": "spanish"
},
"english": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
The problem with that is that I have explicitly write every field and its analyzers to a search query.
And it will not allow to search with "_all" and use multiple analyzers.
Is there a way to make "_all" query use multiple analyzers?
Something like "_all.ngram", "_all.spanish" and without using copy_to do duplicate the data?
Is it possible to combine ngram analyzer with a spanish (or any other foreign language) and make a single custom analyzer?
I have tested the following settings but these did not work:
PUT /ngrams_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
}
},
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
},
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": ["ejemplo"]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer"
]
}
}
}
},
"mappings": {
"my_entity": {
"_all": {
"enabled": true,
"analyzer": "ngram_analyzer"
},
"properties": {
"my_field": {
"type": "text",
"fields": {
"analyzer1": {
"type": "string",
"analyzer": "ngram_analyzer"
},
"analyzer2": {
"type": "string",
"analyzer": "spanish"
},
"analyzer3": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
}
GET /ngrams_index/_analyze
{
"field": "_all",
"text": "Hola, me llamo Juan."
}
returns: just ngram results, without Spanish analysis
where
GET /ngrams_index/_analyze
{
"field": "my_field.analyzer2",
"text": "Hola, me llamo Juan."
}
properly analyzes the search string.
Is it possible to build a custom analyzer which combine Spanish and ngram?
There is a way to create a custom ngram+language analyzer:
PUT /ngrams_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"ngram_filter": {
"type": "nGram",
"min_gram": 3,
"max_gram": 3
},
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": [
"ejemplo"
]
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer",
"ngram_filter"
]
}
}
}
},
"mappings": {
"my_entity": {
"_all": {
"enabled": true,
"analyzer": "ngram_analyzer"
},
"properties": {
"my_field": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}
}
}
GET /ngrams_index/_analyze
{
"field": "my_field",
"text": "Hola, me llamo Juan."
}
I am currently implementing a simple person search in elastic search. I did some research and found quite a lot content about how to implement features as full text search and so on.
The problem is, that some queries just don't return any results.
I have the following index template:
PUT /_template/template_hca_bp
{
"template": "test",
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
},
"search_ngram": {
"type": "custom",
"tokenizer": "lowercase",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"persons": {
"properties": {
"address": {
"properties": {
"city": {
"type": "text",
"search_analyzer": "standard",
"analyzer": "autocomplete"
},
"countryCode": {
"type": "keyword"
},
"doorNumber": {
"type": "keyword"
},
"id": {
"type": "text",
"index": "no",
"include_in_all": false
},
"stairwayNumber": {
"type": "keyword"
},
"street": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"streetNumber": {
"type": "keyword"
},
"zipCode": {
"type": "keyword"
}
}
},
"id": {
"type": "keyword",
"index": "no",
"include_in_all": false
},
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard",
"boost":2
},
"personType": {
"type": "keyword",
"index": "no",
"include_in_all": false
},
"title": {
"type": "text"
}
}
}
}
}
My query looks like the following:
POST test/_search
{
"query": {
"multi_match": {
"query": "Maria",
"type":"cross_fields",
"fields": [
"name^2", "city", "street", "streetNumber", "zipCode"
]
}
}
}
If I now search e.g. for "Maria" then I get a result. But if I'm searching for a zipCode (e.g. 12345) than I don't get any result.
The analyze api has the following response:
"detail": {
"custom_analyzer": false,
"analyzer": {
"name": "default",
"tokens": [
{
"token": "12345",
"start_offset": 0,
"end_offset": 5,
"type": "<NUM>",
"position": 0,
"bytes": "[31 32 33 34 35]",
"positionLength": 1
}
]
}
}
I'm not getting any response. I have tried term, and match queries and all other kind of stuff, but I can't get it working?
The desired document:
"id": "V2718984F3A0ADA95176424457A068F9DC93FC8BDA0898A4E8248F194AE1AF4FCE04C29F46367DDEC33721C15C2679B7BB",
"name": "Maria Smith",
"personType": "APO",
"address": {
"countryCode": "A",
"city": "Testcity",
"zipCode": "12345",
"street": "Avenue",
"streetNumber": "2"
}
i have edge_ngram configured for a filed.
suppose the word is indexed in edge_ngram is : quick
and its analyzing as : q,qu,qui,quic,quick
when i am tring to search quickfull the words contaning quick is also coming in results.
i want words only containing quickfull comes else it gives no results.
this is my mapping :
{
"john_search": {
"aliases": {},
"mappings": {
"drugs": {
"properties": {
"chemical": {
"type": "string"
},
"cutting_allowed": {
"type": "boolean"
},
"id": {
"type": "long"
},
"is_banned": {
"type": "boolean"
},
"is_discontinued": {
"type": "boolean"
},
"manufacturer": {
"type": "string"
},
"name": {
"type": "string",
"boost": 2,
"fields": {
"exact": {
"type": "string",
"boost": 4,
"analyzer": "standard"
},
"phenotic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
},
"analyzer": "autocomplete"
},
"price": {
"type": "string",
"index": "not_analyzed"
},
"refrigerated": {
"type": "boolean"
},
"sell_freq": {
"type": "long"
},
"xtra_name": {
"type": "string"
}
}
}
},
"settings": {
"index": {
"creation_date": "1475061490060",
"analysis": {
"filter": {
"my_metaphone": {
"replace": "false",
"type": "phonetic",
"encoder": "metaphone"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "100"
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "standard"
},
"dbl_metaphone": {
"filter": "my_metaphone",
"tokenizer": "standard"
}
}
},
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "qoRll9uATpegMtrnFTsqIw",
"version": {
"created": "2040099"
}
}
},
"warmers": {}
}
}
any help would be appreciated
It's because your name field has "analyzer": "autocomplete", which means that the autocomplete analyzer will also be applied at search time, hence the search term quickfull will be tokenized to q, qu, qui, quic, quick, quickf, quickfu, quickful and quickfull and that matches quick as well.
In order to prevent this, you need to set "search_analyzer": "standard" on the name field to override the index-time analyzer.
"name": {
"type": "string",
"boost": 2,
"fields": {
"exact": {
"type": "string",
"boost": 4,
"analyzer": "standard"
},
"phenotic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
},
"analyzer": "autocomplete",
"search_analyzer": "standard" <--- add this
},
The settings for one of my indexes is as follows, however the stemmer isn't being applied. For example a search for fox will not pick up articles that include the term foxes. I can't see why as the order of the filters is correct (lowercase precedes the stemmer).
{
"articles": {
"settings": {
"index": {
"creation_date": "1436255268907",
"analysis": {
"filter": {
"filter_stemmer": {
"type": "stemmer",
"language": "english"
},
"kill_filters": {
"pattern": ".*_.*",
"type": "pattern_replace",
"replacement": ""
},
"filter_stop": {
"type": "stop"
},
"filter_shingle": {
"min_shingle_size": "2",
"max_shingle_size": "5",
"type": "shingle",
"output_unigrams": "true"
},
"filter_stemmerposs": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"tags_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"filter_stemmerposs",
"filter_stemmer"
],
"tokenizer": "patterntoke"
},
"shingles_analyzer": {
"filter": [
"standard",
"lowercase",
"filter_stop",
"filter_shingle",
"kill_filters",
"filter_stemmerposs",
"filter_stemmer"
],
"char_filter": [
"html_strip"
],
"type": "custom",
"tokenizer": "standard"
}
},
"tokenizer": {
"patterntoke": {
"type": "pattern",
"pattern": ","
}
}
},
"number_of_shards": "5",
"number_of_replicas": "1",
"version": {
"created": "1060099"
},
"uuid": "H2NsE3eKT1y_ArPOPbjT6w"
}
}
}
}
And below is the mapping:
{
"articles": {
"mappings": {
"article": {
"properties": {
"accountid": {
"type": "double",
"include_in_all": false
},
"article": {
"type": "string",
"index_analyzer": "shingles_analyzer"
},
"articleid": {
"type": "double",
"include_in_all": false
},
"categoryid": {
"type": "double",
"include_in_all": false
},
"draftflag": {
"type": "double",
"include_in_all": false
},
"files": {
"type": "string",
"index_analyzer": "tags_analyzer"
},
"tags": {
"type": "string",
"index_analyzer": "tags_analyzer"
},
"title": {
"type": "string",
"index_analyzer": "shingles_analyzer"
},
"topicid": {
"type": "double",
"include_in_all": false
}
}
}
}
}
}
The sample documents are varied but for example 1 contains the token fox and another foxes (both derived from the article field) but each document is only found when the search is fox or foxes and not either which is what I'd expect. The search used Is a fuzzylikethis search (I'm using Nest .net to execute the query)