elastic search "word" data type : mapping is not proper - elasticsearch

I am newbie in elastic search and I have been experimenting with it from the past few days. I am not able to do proper mapping as all the data types, I mentioned in the mapping are being mapped to the type "word" instead of the respective types. Here is the mapping I have done
POST my_index
{
"settings" : {
"analysis" : {
"analyzer" : {
"lowercase_analyzer" : {
"type" : "custom",
"tokenizer" : "keyword",
"filter" : ["lowercase"]
}
}
}
},
"mappings" : {
"my_type" : {
"properties" : {
"name" : {"type" : "string" ,"index" : "not_analyzed" },
"field0" : {
"properties" : {
"field1" : {"type" : "boolean" },
"field2" : {"type" : "string" },
"field3" : {"type" : "date", "format" : "yyyy-MM-dd HH:mm:ss SSSSSS" }
}
}
}
}
}
}
To test the mapping, I am using "_analyze" api as follows
GET my_index/_analyze
{
"field" : "name",
"text" : "10.90.99.6"
}
which gives me the following result
{
"tokens": [
{
"token": "10.90.99.6",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 0
}
]
}
I am expecting the "type" in the result to be "string". However I could not understand why it is returning of type "word". Also the same is happening with other fields as well when I post the data of type boolean or time stamp in the nested fields using _analyze api.
GET my_index/_analyze
{
"field" : "field0.field1",
"text" : "true"
}
Result
{
"tokens": [
{
"token": "true",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
}
]
}
Where am I doing wrong in my mapping.
Also, there is no such "word" data type in elastic search reference api.

That's not an error at all, you did it right. The concept of type in the mapping and the concept of type in the _analyze API is different.
The _analyze API will simply return all tokens that are present in the field you're analyzing and each of those tokens is typed "word". That comes from the Lucene TypeAttribute class and as you can see there's only one default value "word"

Related

How to get index item that has : "name" - "McLaren" by searching with "mclaren" in Elasticsearch 1.7?

Here is the tokenizer -
"tokenizer": {
"filename" : {
"pattern" : "[^\\p{L}\\d]+",
"type" : "pattern"
}
},
Mapping -
"name": {
"type": "string",
"analyzer": "filename_index",
"include_in_all": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lower_case_sort": {
"type": "string",
"analyzer": "naturalsort"
}
}
},
Analyzer -
"filename_index" : {
"tokenizer" : "filename",
"filter" : [
"word_delimiter",
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer",
"czech_stop",
"czech_keywords",
"czech_stemmer"
]
},
I would like to get index item by searching - mclaren, but the name indexed is McLaren.
I would like to stick to query_string cause a lot of other functionality is based on that. Here is the query with what I cant get the expected result -
{
"query": {
"filtered": {
"query": {
"query_string" : {
"query" : "mclaren",
"default_operator" : "AND",
"analyze_wildcard" : true,
}
}
}
},
"size" :50,
"from" : 0,
"sort": {}
}
How I could accomplish this? Thank you!
I got it ! The problem is certainly around the word_delimiter token filter.
By default it :
Split tokens at letter case transitions. For example: PowerShot →
Power, Shot
Cf documentation
So macLaren generate two tokens -> [mac, Laren] when maclaren only generate one token ['maclaren'].
analyze example :
POST _analyze
{
"tokenizer": {
"pattern": """[^\p{L}\d]+""",
"type": "pattern"
},
"filter": [
"word_delimiter"
],
"text": ["macLaren", "maclaren"]
}
Response:
{
"tokens" : [
{
"token" : "mac",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "Laren",
"start_offset" : 3,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "maclaren",
"start_offset" : 9,
"end_offset" : 17,
"type" : "word",
"position" : 102
}
]
}
So I think one option is to configure your word_delimiter with the option split_on_case_change to false (see parameters doc)
Ps: remeber to remove the settings you previously added (cf comments), since with this setting, your query string query will only target the name field that does not exists.

Elasticsearch mapping for the UK postcodes, able to deal with spacing and capatalization

I am looking for a mapping/analyzer setup for Elasticsearch 7 with the UK postcodes. We do not require any fuzzy operator, but should be able to deal with variance in capital letters and spacing.
Some examples:
Query string: "SN13 9ED" should return:
sn139ed
SN13 9ED
Sn13 9ed
but should not return:
SN13 1EP
SN131EP
The keyword analyzer is used by default and this seems to be sensitive to spacing issues, but not to capital letters. It also will return a match for SN13 1EP unless we specify a query as SN13 AND 9ED, which we do not want.
Additionally, with the keyword analyzer, a query of SN13 9ED returns a result of SN13 1EP with a higher relevance than SN13 9ED even though this should be the exact match. Why are 2 matches in the same string a lower relevance than just 1 match?
Mapping for postal code
"post_code": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
Query
"query" => array:1 [▼
"query_string" => array:1 [▼
"query" => "KT2 7AJ"
]
]
I believe based on my comments, you may have been able to filter out SN13 1EP when your search string would be SN13 9ED.
Hope you are aware of what Analysis is, how Analyzers work on text field and how by default Standard Analyzer is applied on tokens before they eventually are stored in inverted index. Note that this is only applied on text fields.
Looking at your mapping, if you would have used searching on post_code and not post_code.keyword, I believe capitalization would have been resolved because ES for text field by default uses Standard Analyzer which means your tokens would eventually gets saved in index in lowercase format and even while querying, ES during querying time, the analyzer would be applied before it searches in the inverted index.
Note that by default, the same analyzer as configured in the mapping are applied during index time as well as search time on that field
For the scenarios where you have sn131ep what I've done is made use of Pattern Capture Token Filter where I've specified a regex which would break the token into two of lengths 4 and 3 each and thereby save them in inverted index which in this case would be sn13 and 1ep. I'm also lowercasing them before I store them in inverted index.
Note that the scenario I'm adding for your postcode is that its size is fixed i.e. having 7 characters. You can add more patterns if that is not the case
Please see below for more details:
Mapping:
PUT my_postcode_index
{
"settings" : {
"analysis" : {
"filter" : {
"mypattern" : {
"type" : "pattern_capture",
"preserve_original" : true,
"patterns" : [
"(\\w{4}+)|(\\w{3}+)", <--- Note this and feel free to add more patterns
"\\s" <--- Filter based on whitespace
]
}
},
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "pattern",
"filter" : [ "mypattern", "lowercase" ] <--- Note the lowercase here
}
}
}
},
"mappings": {
"properties": {
"postcode":{
"type": "text",
"analyzer": "my_analyzer", <--- Note this
"fields":{
"keyword":{
"type": "keyword"
}
}
}
}
}
}
Sample Documents:
POST my_postcode_index/_doc/1
{
"postcode": "SN131EP"
}
POST my_postcode_index/_doc/2
{
"postcode": "sn13 1EP"
}
POST my_postcode_index/_doc/3
{
"postcode": "sn131ep"
}
Note that these documents are semantically the same.
Request Query:
POST my_postcode_index/_search
{
"query": {
"query_string": {
"default_field": "postcode",
"query": "SN13 1EP",
"default_operator": "AND"
}
}
}
Response:
{
"took" : 24,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.6246513,
"hits" : [
{
"_index" : "my_postcode_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6246513,
"_source" : {
"postcode" : "SN131EP"
}
},
{
"_index" : "my_postcode_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.6246513,
"_source" : {
"postcode" : "sn131ep"
}
},
{
"_index" : "my_postcode_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.5200585,
"_source" : {
"postcode" : "sn13 1EP"
}
}
]
}
}
Notice that all three documents are returned even with queries snp131p and snp13 1ep.
Additional Note:
You can make use of Analyze API to figure out what tokens are created for a particular text
POST my_postcode_index/_analyze
{
"analyzer": "my_analyzer",
"text": "sn139ed"
}
And you can see below what tokens are stored in inverted index.
{
"tokens" : [
{
"token" : "sn139ed",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "sn13",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "9ed",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
}
]
}
Also:
You may also want to read about Ngram Tokenizer. I'd advise you to play around both the solutions and see what best suits your inputs.
Please test it and let me know if you have any queries.
In addition to Opsters answer, the following can also be used to tackle the issue from the opposite angle. For Opster's answer, they suggest splitting value by a known postcode pattern, which is great.
If we do not know the pattern, the following can be used:
{
"analysis": {
"filter": {
"whitespace_remove": {
"pattern": " ",
"type": "pattern_replace",
"replacement": ""
}
},
"analyzer": {
"no_space_analyzer": {
"filter": [
"lowercase",
"whitespace_remove"
],
"tokenizer": "keyword"
}
}
}
}
{
"post_code": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"analyzer": "no_space_analyzer"
}
}
This allows us to search with any kind of spacing, and with any case due to the lowercase filter.
sn13 1ep, s n 1 3 1 e p, sn131ep will all match against SN13 1EP
I think the main drawback to this option, however, is we will no longer get any results for sn13 as we are not producing at tokens. sn13* would bring us back results, however.
Is it possible to mix both of these methods together so we can have the best of both worlds?

Elasticsearch query response influenced by _id

I created an index with the following mappings and settings:
{
"settings": {
"analysis": {
"analyzer": {
"case_insensitive_index": {
"type": "custom",
"tokenizer": "filename",
"filter": ["icu_folding", "edge_ngram"]
},
"default_search": {
"type":"standard",
"tokenizer": "filename",
"filter": [
"icu_folding"
]
}
},
"tokenizer" : {
"filename" : {
"pattern" : "[^\\p{L}\\d]+",
"type" : "pattern"
}
},
"filter" : {
"edge_ngram" : {
"side" : "front",
"max_gram" : 20,
"min_gram" : 3,
"type" : "edgeNGram"
}
}
}
},
"mappings": {
"metadata": {
"properties": {
"title": {
"type": "string",
"analyzer": "case_insensitive_index"
}
}
}
}
}
I have the following documents:
{"title":"P-20150531-27332_News.jpg"}
{"title":"P-20150531-27341_News.jpg"}
{"title":"P-20150531-27512_News.jpg"}
{"title":"P-20150531-27343_News.jpg"}
creating a document with simple numerical ids
111
112
113
114
and querying using the query
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title" : {
"query" : "P-20150531-27332_News.jpg",
"type" : "boolean",
"fuzziness" : "AUTO"
}
}
}
}
results in the correct scoring and ordering of the documents returned:
P-20150531-27332_News.jpg -> 2.780985
P-20150531-27341_News.jpg -> 0.8262239
P-20150531-27512_News.jpg -> 0.8120311
P-20150531-27343_News.jpg -> 0.7687101
Strangely, creating the same documents with UUIDs
557eec2e3b00002c03de96bd
557eec0f3b00001b03de96b8
557eec0c3b00001b03de96b7
557eec123b00003a03de96ba
as IDs results in different scorings of the documents:
P-20150531-27341_News.jpg -> 2.646321
P-20150531-27332_News.jpg -> 2.1998127
P-20150531-27512_News.jpg -> 1.7725387
P-20150531-27343_News.jpg -> 1.2718291
Is this an intentional behaviour of Elasticsearch? If yes - how can I preserve the correct ordering regardless of the IDs used?
In the query it looks like you should be using 'default_search' as the analyzer for match query unless you actuall intended to use egde-ngram on the search query too.
Example :
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title" : {
"query" : "P-20150531-27332_News.jpg",
"type" : "boolean",
"fuzziness" : "AUTO",
"analyzer" : "default_search"
}
}
}
}
default_search would be the default-search analyzer only if there is are no explicit search_analyzer or analyzer specified in the mapping of the field.
The articlehere gives a good explanation of the rules by which analyzers are applied.
Also to ensure idf takes into account documents across shards you could use search_type=dfs_query_then_fetch

elasticsearch 1.6 field norm calculation with shingle filter

I am trying to understand the fieldnorm calculation in elasticsearch (1.6) for documents indexed with a shingle analyzer - it does not seem to include shingled terms. If so, is it possible to configure the calculation to include the shingled terms? Specifically, this is the analyzer I used:
{
"index" : {
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
This is the mapping used:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer"}
}
}
}
And I posted a few documents:
{"text" : "the"}
{"text" : "the quick"}
{"text" : "the quick brown"}
{"text" : "the quick brown fox jumps"}
...
When using the following query with the explain API,
{
"query": {
"match": {
"text" : "the"
}
}
}
I get the following fieldnorms (other details omitted for brevity):
"_source": {
"text": "the quick"
},
"_explanation": {
"value": 0.625,
"description": "fieldNorm(doc=0)"
}
"_source": {
"text": "the quick brown fox jumps over the"
},
"_explanation": {
"value": 0.375,
"description": "fieldNorm(doc=0)"
}
The values seem to suggest that ES sees 2 terms for the 1st document ("the quick") and 7 terms for the 2nd document ("the quick brown fox jumps over the"), excluding the shingles. Is it possible to configure ES to calculate field norm with the shingled terms too (ie. all terms returned by the analyzer)?
You would need to customize the default similarity by disabling the discount overlap flag.
Example:
{
"index" : {
"similarity" : {
"no_overlap" : {
"type" : "default",
"discount_overlaps" : false
}
},
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
Mapping:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer", "similarity
" : "no_overlap"}
}
}
}
To expand further:
By default overlaps i.e Tokens with 0 position increment are ignored when computing norm
Example below shows the postion of tokens generated by the "my_analyzer" described in OP :
get <index_name>/_analyze?field=text&text=the quick
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "the quick",
"start_offset": 0,
"end_offset": 9,
"type": "shingle",
"position": 1
},
{
"token": "quick",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
}
]
}
According to lucene documentation the length norm calculation for default similarity is implemented as follows :
state.getBoost()*lengthNorm(numTerms)
where numTerms is
if setDiscountOverlaps(boolean) is false
FieldInvertState.getLength()
else
FieldInvertState.getLength() - FieldInvertState.getNumOverlap()

Array input in Elasticsearch Completion Suggester

I am trying to figure out why a scenario in ES doesn't seem to work for me. I have a pretty straightforward suggest mapping setup:
{
"ding" : {
"properties" : {
"name" : { "type" : "string" },
"title" : { "type" : "string" },
"test" : { "type" : "string" },
"suggest": {
"type": "completion",
"analyzer": "simple",
"payloads": true,
"max_input_length": 50
}
}
}
}
And indexed the documents as such:
{
"title": "Title",
"name": "Name",
"test": "Test",
"suggest": {
"input": [
"Koolmees 21, Breda",
"4822PP 21"
]
}
}
The completion suggest works fine on:
{
"ding" : {
"text" : "Koo",
"completion" : {
"field" : "suggest"
}
}
}
But not on:
{
"ding" : {
"text" : "482",
"completion" : {
"field" : "suggest"
}
}
}
Is it because the input starts with a numeric character? I can't seem to figure it out :S
The completion suggester uses the simple analyzer by default. If you use the Analyze API you can see it removes the numbers:
curl -XGET 'localhost:9200/_analyze?analyzer=simple&pretty=true' -d '4822PP 21'
returns
{
"tokens" : [ {
"token" : "pp",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 1
} ]
}
You may want to switch the auto completions to use the Standard analyzer.

Resources