ElasticSearch Edge NGram Preserve Numbers - elasticsearch

I'm working on creating an autocompletion API for residential addresses.
I would like to preserve the numbers, so I don't get the following problem:
Let's say the index contains a couple of documents:
{"fullAddressLine": "Kooimanweg 10 1442BZ Purmerend", "streetName": "Kooimanweg", houseNumber: "10", "postCode": "1442BZ", "cityName": "Purmerend"}
{"fullAddressLine": "Kooimanweg 1009 1442BZ Purmerend", "streetName": "Kooimanweg", houseNumber: "1009", "postCode": "1442BZ", "cityName": "Purmerend"}
{"fullAddressLine": "Kooimanweg 1011 1442BZ Purmerend", "streetName": "Kooimanweg", houseNumber: "1011", "postCode": "1442BZ", "cityName": "Purmerend"}
{"fullAddressLine": "Kooimanweg 1013 1442BZ Purmerend", "streetName": "Kooimanweg", houseNumber: "1013", "postCode": "1442BZ", "cityName": "Purmerend"}
These are the settings and mappings:
{
"settings": {
"analysis": {
"filter": {
"EdgeNGramFilter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 40
}
},
"analyzer": {
"EdgeNGramAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"EdgeNGramFilter"
]
},
"keywordAnalyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"fullAddressLine": {
"type": "text",
"analyzer": "EdgeNGramAnalyzer",
"search_analyzer": "standard",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer"
}
}
}
}
}
}
And this would be the ElasticSearch query:
{
"query": {
"bool": {
"must": [{
"match": {
"fullAddressLine": {
"query": "kooiman 10",
"operator": "and"
}
}
}]
}
}
}
The result of this is:
Kooimanweg 10 1442BZ Purmerend
Kooimanweg 1009 1442BZ Purmerend
Kooimanweg 1011 1442BZ Purmerend
Kooimanweg 1033 1442BZ Purmerend
This works, but I would only like to see this:
Kooimanweg 10 1442BZ Purmerend
How can I change the query or mappings/settings to achieve this result?
When using the "EdgeNgramAnalyzer" analyzer on "Test 1009" I get:
{
"tokens" : [
{
"token" : "t",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "te",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "tes",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "test",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "1",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "10",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "100",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "1009",
"start_offset" : 5,
"end_offset" : 9,
"type" : "<NUM>",
"position" : 1
}
]
}
I want to reserve numbers so they don't get split.
Thanks to everyone in advance.

Related

Why can't I search email domain name when using `text` type in Elasticsearch

I have a email field in the document get saved in Elasticsearch index. I am able to search the value before # but I can't find anything by searching the domain value.
For example, below query give me nothing:
GET transaction-green/_search
{
"query": {
"match": {
"email": "gmail"
}
},
"_source": {
"includes": [
"email"
]
}
}
but it returns document if I search test#gmail.com or just test.
The mapping for this email field is the default text type:
"email" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
why does the domain name ignored from searching?
It is happening due to standrad analyzer. As you are using default analyzer, it will analyze your value something like below:
You can use below API for checking analyzer:
POST email/_analyze
{
"analyzer": "standard",
"text": ["test#gmail.com"]
}
{
"tokens" : [
{
"token" : "test",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "gmail.com",
"start_offset" : 5,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
You can define your custom analyzer with character filter like below and your query will work:
PUT /email
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "pattern_replace",
"pattern": "\\.",
"replacement": " "
}
}
}
},
"mappings": {
"properties": {
"email":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Now you can analyze value using below analzyer and you can see it will create 3 seperate token for email.
POST email/_analyze
{
"analyzer": "my_analyzer",
"text": ["test#gmail.com"]
}
{
"tokens" : [
{
"token" : "test",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "gmail",
"start_offset" : 5,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "com",
"start_offset" : 11,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}

Elasticsearch - at&t and procter&gamble cases

By default Elasticsearch with English analyzer breaks at&t into tokens at, t and then removes at as a stopword.
POST _analyze
{
"analyzer": "english",
"text": "A word AT&T Procter&Gamble"
}
As a result tokens look like:
{
"tokens" : [
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "gambl",
"start_offset" : 20,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 5
}
]
}
I want to be able to match exactly at&t and at the same time to be able to search for procter&gamble exactly and to be able to search for e.g. only procter.
So I want to build an analizer which created both tokens
at&t and t for the at&t string
and
procter, gambl, procter&gamble for procter&gamble.
It there a way to create such an analyzer? Or should I create 2 index fields - one for regular English analyzer and the other one for English except tokenization by & ?
Mappings: You can tokenize on whitespace and use a word delimiter filter to create tokens for at&t
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
}
}
}
}
}
Tokens:
{
"analyzer": "whitespace_with_acronymns",
"text": "A word AT&T Procter&Gamble"
}
Result: at&t is tokenized as at,t,att, so you can search this by at,t and at&t.
{
"tokens" : [
{
"token" : "a",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "at",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "att",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "word",
"position" : 4
},
{
"token" : "proctergamble",
"start_offset" : 12,
"end_offset" : 26,
"type" : "word",
"position" : 4
},
{
"token" : "gamble",
"start_offset" : 20,
"end_offset" : 26,
"type" : "word",
"position" : 5
}
]
}
If you want to remove stop word "at", you can add stopword filter
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns",
"english_possessive_stemmer",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
}
}

how to build a backward edge n-gram tokenizer

I only see n-gram and edge n-gram, both of them start from the first letter.
I would like to create some tokenizer which can produce the following tokens.
For example:
600140 -> 0, 40, 140, 0140, 00140, 600140
You can leverage the reverse token filter twice coupled with the edge_ngram one:
PUT reverse
{
"settings": {
"analysis": {
"analyzer": {
"reverse_edgengram": {
"tokenizer": "keyword",
"filter": [
"reverse",
"edge",
"reverse"
]
}
},
"filter": {
"edge": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 25
}
}
}
},
"mappings": {
"properties": {
"string_field": {
"type": "text",
"analyzer": "reverse_edgengram"
}
}
}
}
Then you can test it:
POST reverse/_analyze
{
"analyzer": "reverse_edgengram",
"text": "600140"
}
Which yields this:
{
"tokens" : [
{
"token" : "40",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "0140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "00140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "600140",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
}
]
}

Can't get proper result from elasticsearch based on query and document tokenization

I'm trying to implement a search system in which I need to use Edge NGRAM Tokenizer. Settings for creating an index are shown below. I have used same tokenizer for both documents and search query.
(Documents are in Perisan language)
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "autocomplete"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge-ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
The problem shows up when I get 0 hits (results) from searching term 'آلمانی' in docs while I have a doc with data : 'آلمان خوب است'.
As you can see the result for analyzing term 'آلمانی' shows that it generates token 'آلمان' and works properly.
{
"tokens" : [
{
"token" : "آ",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "آل",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "آلم",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "آلما",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "آلمان",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "آلمانی",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
The searching query shown below gets 0 hits.
GET /test/_search
{
"query": {"match": {
"title": {"query": "آلمانی" , "operator": "and"}
}}
}
However searching term 'آلما' returns doc with data 'آلمان خوب است'.
How can I fix this problem?
Your assistance would be greatly appreciated.
I found this DevTicks post by Ricardo Heck which solved my problem.
enter the link for more detailed description
I changed my mapping setting like this:
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search",
"fields": {
"ngram": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
}
And now I get doc "آلمان خوب است" by searching the term "آلمانی".

With html_strip as search query analyzer, searches are still performed in HTML markup

In short: can html_strip be used with analyzer which is used only in queries?
I have a very simple test indice with the following settings:
POST test
{
"settings": {
"analysis": {
"filter": {
"synonym": {
"synonyms_path": "/usr/share/wordnet-prolog/wn_s.pl",
"ignore_case": "true",
"type": "synonym",
"format": "wordnet"
}
},
"analyzer": {
"synonym_analyzer": {
"char_filter": "html_strip",
"filter": [
"asciifolding",
"snowball",
"synonym"
],
"type": "custom",
"tokenizer": "lowercase"
}
}
}
},
"mappings": {
"child": {
"_parent": {
"type": "parent"
}
}
}
}
And some sample data:
PUT test/parent/1
{
"type": "flying stuff"
}
PUT test/child/1?parent=1
{
"name": "butterfly"
}
PUT test/child/2?parent=1
{
"name": "<strong>tire</strong>"
}
On corresponding child/parent searches I get results with the tag name "strong", for example:
GET test/parent/_search
{
"query": {
"has_child": {
"type": "child",
"query": {
"match": {
"name": {
"query": "strong",
"analyzer": "synonym_analyzer"
}
}
}
}
}
}
GET test/child/_search
{
"query": {
"match": {
"name": {
"query": "strong",
"analyzer": "synonym_analyzer"
}
}
}
}
What is interesting, when I test tokenizer with http://localhost:9200/test/_analyze?text=%3Cstrong%3Edemo%3C/strong%3E&analyzer=synonym_analyzer&pretty=true data is interpreted correctly (no "strong" and related synonyms):
{
"tokens" : [ {
"token" : "demonstration",
"start_offset" : 8,
"end_offset" : 21,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "demo",
"start_offset" : 8,
"end_offset" : 21,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "show",
"start_offset" : 8,
"end_offset" : 21,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "exhibit",
"start_offset" : 8,
"end_offset" : 21,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "present",
"start_offset" : 8,
"end_offset" : 21,
"type" : "SYNONYM",
"position" : 1
}, {
"token" : "demonstrate",
"start_offset" : 8,
"end_offset" : 21,
"type" : "SYNONYM",
"position" : 1
} ]
}

Resources