Elasticsearch search fo words having '#' character - elasticsearch

For example, I am right now searching like this:
http://localhost:9200/posts/post/_search?q=content:%23sachin
But, I am getting all the results with 'sachin' and not '#sachin'. Also, I am writing a regular expression for getting the count of terms. The facet looks like this:
"facets": {
"content": {
"terms": {
"field": "content",
"size": 1000,
"all_terms": false,
"regex": "#sachin",
"regex_flags": [
"DOTALL",
"CASE_INSENSITIVE"
]
}
}
}
This is not returning any values. I think it has something to do with escaping the '#' inside the regular expression, but I am not sure how to do it. I have tried to escape it \ and \\, but it did not work. Can anyone help me in this regard?

This article gives information on how save # and # using custom analyzers:
https://web.archive.org/web/20160304014858/http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html
curl -XPUT 'http://localhost:9200/twitter' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"tweet_filter" : {
"type" : "word_delimiter",
"type_table": ["# => ALPHA", "# => ALPHA"]
}
},
"analyzer" : {
"tweet_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "tweet_filter"]
}
}
}
},
"mappings" : {
"tweet" : {
"properties" : {
"msg" : {
"type" : "string",
"analyzer" : "tweet_analyzer"
}
}
}
}
}'
This isn't dealing with facets, but the redefining of the type of those special characters in the analyzer could help.

Another approach that worth to consider is to index a special (e.g. "reserved") word instead of hash symbol. For example: HASHSYMBOLCHAR. Make sure that you will replace '#' chars in query as well.

Related

How do I configure elastic search to use the icu_tokenizer?

I'm trying to search a text indexed by elasticsearch and the icu_tokenizer but can't get it working.
My testcase is to tokenize the sentence “Hello. I am from Bangkok”, in thai สวัสดี ผมมาจากกรุงเทพฯ, which should be tokenized to the five words สวัสดี, ผม, มา, จาก, กรุงเทพฯ. (Sample from Elasticsearch - The Definitive Guide)
Searching using any of the last four words fails for me. Searching using any of the space separated words สวัสดี or ผมมาจากกรุงเทพฯ works fine.
If I specify the icu_tokenizer on the command line, like
curl -XGET 'http://localhost:9200/icu/_analyze?tokenizer=icu_tokenizer' -d "สวัสดี ผมมาจากกรุงเทพฯ"
it tokenizes to five words.
My settings are:
curl http://localhost:9200/icu/_settings?pretty
{
"icu" : {
"settings" : {
"index" : {
"creation_date" : "1474010824865",
"analysis" : {
"analyzer" : {
"nfkc_cf_normalized" : [ "icu_normalizer" ],
"tokenizer" : "icu_tokenizer"
}
}
},
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "tALRehqIRA6FGPu8iptzww",
"version" : {
"created" : "2040099"
}
}
}
}
The index is populated with
curl -XPOST 'http://localhost:9200/icu/employee/' -d '
{
"first_name" : "John",
"last_name" : "Doe",
"about" : "สวัสดี ผมมาจากกรุงเทพฯ"
}'
Searching with
curl -XGET 'http://localhost:9200/_search' -d'
{
"query" : {
"match" : {
"about" : "กรุงเทพฯ"
}
}
}'
Returns nothing ("hits" : [ ]).
Performing the same search with one of สวัสดี or ผมมาจากกรุงเทพฯ works fine.
I guess I've misconfigured the index, how should it be done?
The missing part is:
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
In the mapping, the document field have to be specified the analyzer to be using.
[Index] : icu
[type] : employee
[field] : about
PUT /icu
{
"settings": {
"analysis": {
"analyzer": {
"icu_analyzer" : {
"char_filter": [
"icu_normalizer"
],
"tokenizer" : "icu_tokenizer"
}
}
}
},
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
}
test the custom analyzer using followings DSLJson
POST /icu/_analyze
{
"text": "สวัสดี ผมมาจากกรุงเทพฯ",
"analyzer": "icu_analyzer"
}
The result should be [สวัสดี, ผม, มา, จาก, กรุงเทพฯ]
My suggestion would be :
Kibana : Dev Tool could help you for effective query crafting

Elasticsearch: simple query string with latin characters

I'm using a simple query string over the following text:
Jiboia de três metros é capturada em avenida de Governador
Obs: This is the content of my message field
My query string (no results)
"simple_query_string":{
"query":"tr\u00eas",
"fields":["message","author.name","author.id"],
"default_operator":"AND"
}
My query string (1 result)
"simple_query_string":{
"query":"Jiboia",
"fields":["message","author.name","author.id"],
"default_operator":"AND"
}
Have a trick for latin characters?
My mapping:
{"mentions-2016.02.26":{"aliases":{"mentions_ro":{},"mentions_rw":{}},"mappings":{"mention":{"dynamic_templates":[{"analyzer":{"mapping":{"type":"string","index":"not_analyzed","store":"no"},"match":"*","match_mapping_type":"string"}}],"date_detection":false,"properties":{"analytics":{"properties":{"collect_delay":{"type":"long"},"number_of_replies":{"type":"long"},"twitter_reach":{"type":"long"},"youtube_views":{"type":"long"}}},"author":{"properties":{"gender":{"type":"string","index":"not_analyzed"},"id":{"type":"string"},"locale":{"properties":{"area":{"type":"string","index":"not_analyzed"},"country":{"type":"string","index":"not_analyzed"}}},"name":{"type":"string"},"platform_id":{"type":"long"}}},"created_at":{"type":"date","format":"dateOptionalTime"},"elastic_date":{"type":"date","format":"dateOptionalTime"},"id":{"type":"long"},"items_batch_created":{"type":"date","format":"dateOptionalTime"},"message":{"type":"string"},"metadata":{"properties":{"event":{"type":"string","index":"not_analyzed"},"timestamp":{"type":"long"}}},"monitoring":{"properties":{"id":{"type":"long"},"owner":{"properties":{"email":{"type":"string","index":"not_analyzed"},"id":{"type":"long"},"plan":{"properties":{"active":{"type":"string","index":"not_analyzed"},"name":{"type":"string","index":"not_analyzed"},"paid":{"type":"string","index":"not_analyzed"}}}}}}},"parent_id":{"type":"long"},"published_at":{"type":"date","format":"dateOptionalTime"},"raw_content":{"properties":{"actor_link":{"type":"string","index":"not_analyzed"},"aid":{"type":"string","index":"not_analyzed"},"atom_content":{"type":"string","index":"not_analyzed"},"attachment_content":{"type":"string","index":"not_analyzed"},"attachment_image":{"type":"string","index":"not_analyzed"},"attachment_text":{"type":"string","index":"not_analyzed"},"attachment_url":{"type":"string","index":"not_analyzed"},"attribution":{"type":"string","index":"not_analyzed"},"author":{"type":"string","index":"not_analyzed"},"author_name":{"type":"string","index":"not_analyzed"},"author_uri":{"type":"string","index":"not_analyzed"},"can_comment":{"type":"string","index":"not_analyzed"},"caption":{"type":"string","index":"not_analyzed"},"cast":{"type":"string","index":"not_analyzed"},"category":{"type":"string","index":"not_analyzed"},"channellink":{"type":"string","index":"not_analyzed"},"channeltitle":{"type":"string","index":"not_analyzed"},"comment_id":{"type":"string","index":"not_analyzed"},"comment_info":{"type":"string","index":"not_analyzed"},"comment_real_id":{"type":"string","index":"not_analyzed"},"comments":{"type":"string","index":"not_analyzed"},"content":{"type":"string","index":"not_analyzed"},"created_at":{"type":"string","index":"not_analyzed"},"created_time":{"type":"string","index":"not_analyzed"},"createdat":{"type":"long"},"date_timestamp":{"type":"string","index":"not_analyzed"},"dateuploaded":{"type":"string","index":"not_analyzed"},"description":{"type":"string","index":"not_analyzed"},"displayName":{"type":"string","index":"not_analyzed"},"download":{"type":"string","index":"not_analyzed"},"downloadurl":{"type":"string","index":"not_analyzed"},"duration":{"type":"string","index":"not_analyzed"},"embed":{"type":"string","index":"not_analyzed"},"embed_privacy":{"type":"string","index":"not_analyzed"},"farm":{"type":"long"},"firstname":{"type":"string","index":"not_analyzed"},"flickrid":{"type":"string","index":"not_analyzed"},"fonte_id":{"type":"string","index":"not_analyzed"},"format":{"type":"string","index":"not_analyzed"},"fotoPai":{"type":"string","index":"not_analyzed"},"from_id":{"type":"string","index":"not_analyzed"},"from_name":{"type":"string","index":"not_analyzed"},"from_user":{"type":"string","index":"not_analyzed"},"from_user_id":{"type":"string","index":"not_analyzed"},"from_user_profile_image_url":{"type":"string","index":"not_analyzed"},"gdcomments":{"type":"string","index":"not_analyzed"},"gender":{"type":"string","index":"not_analyzed"},"guid":{"type":"string","index":"not_analyzed"},"height":{"type":"string","index":"not_analyzed"},"icon":{"type":"string","index":"not_analyzed"},"idComment":{"type":"string","index":"not_analyzed"},"idVideo":{"type":"string","index":"not_analyzed"},"id_externo":{"type":"string","index":"not_analyzed"},"idexterno":{"type":"string","index":"not_analyzed"},"image":{"type":"string","index":"not_analyzed"},"imagem":{"type":"string","index":"not_analyzed"},"impactoyoutube":{"type":"string","index":"not_analyzed"},"inReplyTo":{"properties":{"id":{"type":"string","index":"not_analyzed"},"url":{"type":"string","index":"not_analyzed"}}},"in_reply_to_screen_name":{"type":"string","index":"not_analyzed"},"in_reply_to_status_id":{"type":"long"},"incontest":{"type":"string","index":"not_analyzed"},"isPicture":{"type":"boolean"},"is_hd":{"type":"string","index":"not_analyzed"},"is_private":{"type":"string","index":"not_analyzed"},"is_transcoding":{"type":"string","index":"not_analyzed"},"iso_language_code":{"type":"string","index":"not_analyzed"},"klout":{"type":"long"},"language":{"type":"string","index":"not_analyzed"},"like_info":{"type":"string","index":"not_analyzed"},"likes":{"type":"string","index":"not_analyzed"},"link":{"type":"string","index":"not_analyzed"},"link_related":{"type":"string","index":"not_analyzed"},"link_self":{"type":"string","index":"not_analyzed"},"location":{"type":"string","index":"not_analyzed"},"mediacategory":{"type":"string","index":"not_analyzed"},"mediacontent":{"type":"string","index":"not_analyzed"},"mediadescription":{"type":"string","index":"not_analyzed"},"mediakeywords":{"type":"string","index":"not_analyzed"},"mediaplayer":{"type":"string","index":"not_analyzed"},"mediathumbnail":{"type":"string","index":"not_analyzed"},"mediatitle":{"type":"string","index":"not_analyzed"},"message":{"type":"string","index":"not_analyzed"},"modified_date":{"type":"string","index":"not_analyzed"},"monitoramento_id":{"type":"string","index":"not_analyzed"},"name":{"type":"string","index":"not_analyzed"},"note_count":{"type":"long"},"number_of_comments":{"type":"string","index":"not_analyzed"},"number_of_likes":{"type":"string","index":"not_analyzed"},"number_of_plays":{"type":"string","index":"not_analyzed"},"owner":{"type":"string","index":"not_analyzed"},"parent_id":{"type":"string","index":"not_analyzed"},"permalink":{"type":"string","index":"not_analyzed"},"photo":{"properties":{"default":{"type":"boolean"},"prefix":{"type":"string","index":"not_analyzed"},"suffix":{"type":"string","index":"not_analyzed"}}},"picture":{"type":"string","index":"not_analyzed"},"post_id":{"type":"string","index":"not_analyzed"},"privacy":{"type":"string","index":"not_analyzed"},"profile_image_url":{"type":"string","index":"not_analyzed"},"profile_picture":{"type":"string","index":"not_analyzed"},"pubdate":{"type":"string","index":"not_analyzed"},"publicado":{"type":"string","index":"not_analyzed"},"published":{"type":"string","index":"not_analyzed"},"realname":{"type":"string","index":"not_analyzed"},"removido":{"type":"string","index":"not_analyzed"},"retroactive":{"type":"boolean"},"secret":{"type":"string","index":"not_analyzed"},"secretkey":{"type":"string","index":"not_analyzed"},"server":{"type":"string","index":"not_analyzed"},"share_id":{"type":"string","index":"not_analyzed"},"slide_id":{"type":"string","index":"not_analyzed"},"slideshowembedurl":{"type":"string","index":"not_analyzed"},"slideshowtype":{"type":"string","index":"not_analyzed"},"source":{"type":"string","index":"not_analyzed"},"src_big":{"type":"string","index":"not_analyzed"},"status":{"type":"string","index":"not_analyzed"},"summary":{"type":"string","index":"not_analyzed"},"t_id":{"type":"string","index":"not_analyzed"},"tags":{"type":"string","index":"not_analyzed"},"text":{"type":"string","index":"not_analyzed"},"texto":{"type":"string","index":"not_analyzed"},"thumbnail":{"type":"string","index":"not_analyzed"},"thumbnails":{"type":"string","index":"not_analyzed"},"thumbnailsize":{"type":"string","index":"not_analyzed"},"thumbnailsmallurl":{"type":"string","index":"not_analyzed"},"thumbnailurl":{"type":"string","index":"not_analyzed"},"thumbnailxlargeurl":{"type":"string","index":"not_analyzed"},"thumbnailxxlargeurl":{"type":"string","index":"not_analyzed"},"tip_id":{"type":"string","index":"not_analyzed"},"title":{"type":"string","index":"not_analyzed"},"to":{"type":"string","index":"not_analyzed"},"to_user":{"type":"string","index":"not_analyzed"},"to_user_id":{"type":"long"},"tumblr_id":{"type":"string","index":"not_analyzed"},"tweet_id":{"type":"string","index":"not_analyzed"},"type":{"type":"string","index":"not_analyzed"},"update_id":{"type":"string","index":"not_analyzed"},"updated":{"type":"string","index":"not_analyzed"},"updatedVideo":{"type":"string","index":"not_analyzed"},"updated_time":{"type":"string","index":"not_analyzed"},"upload_date":{"type":"string","index":"not_analyzed"},"url":{"type":"string","index":"not_analyzed"},"urls":{"type":"string","index":"not_analyzed"},"user_id":{"type":"string","index":"not_analyzed"},"user_url":{"type":"string","index":"not_analyzed"},"userimageurl":{"type":"string","index":"not_analyzed"},"username":{"type":"string","index":"not_analyzed"},"userurl":{"type":"string","index":"not_analyzed"},"veioDoAlbum":{"type":"boolean"},"video":{"type":"long"},"vimeo_id":{"type":"string","index":"not_analyzed"},"wall_id":{"type":"string","index":"not_analyzed"},"width":{"type":"string","index":"not_analyzed"},"ytduration":{"type":"string","index":"not_analyzed"}}},"raw_content_hash":{"type":"string","index":"not_analyzed"},"search":{"properties":{"id":{"type":"long"},"social_network":{"type":"string","index":"not_analyzed"},"type":{"type":"string","index":"not_analyzed"},"type_id":{"type":"string","index":"not_analyzed"}}},"sentiment":{"properties":{"automatic":{"properties":{"active":{"type":"long"},"precision":{"type":"long"},"value":{"type":"string","index":"not_analyzed"}}},"value":{"type":"string","index":"not_analyzed"}}},"tag":{"properties":{"count":{"type":"long"},"ids":{"type":"string","index":"not_analyzed"}}},"title":{"type":"string","index":"not_analyzed"},"type":{"type":"string","index":"not_analyzed"},"updated_at":{"type":"date","format":"dateOptionalTime"},"words":{"type":"string","index":"not_analyzed"}}}},"settings":{"index":{"refresh_interval":"2s","number_of_shards":"7","gc_deletes":"1814400","creation_date":"1456497520658","number_of_replicas":"2","version":{"created":"1050299"},"uuid":"sp4CJpxMRf-_z0bUtHTrjA"}},"warmers":{}}}
Do you need let elasticsearch know how handle your characters.
I did an example using an custom tokenizer like this:
curl -XPOST "http://192.168.99.100:9200/my_type/my_type/my_type" -d'
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"custom_filter" : {
"type" : "word_delimiter",
"type_table": ["ê => ALPHA", "Ê => ALPHA"]
}
},
"analyzer" : {
"custom_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "custom_filter"]
}
}
}
},
"mappings" : {
"my_type" : {
"properties" : {
"msg" : {
"type" : "string",
"analyzer" : "custom_analyzer"
}
}
}
}
}'
I just created an analyzer using a tokenizer that know that ê and Ê need to be interpreted as characters.
after that i just do a search in my msg field
curl -XPOST "http://192.168.99.100:9200/my_type/my_type/my_type/my_type" -d'
{
"msg":"três"
}'
And will work :D
I found the problem.
Im using javascript atob function to decode the message after index it on elastic.
The atob function does not work well with my latin characters and break it.
I change atob for the native Buffer class on node js.
Obs: The default analizer work perfect with latin chars!
Sorry!

Indexing a comma-separated value field in Elastic Search

I'm using Nutch to crawl a site and index it into Elastic search. My site has meta-tags, some of them containing comma-separated list of IDs (that I intend to use for search). For example:
contentTypeIds="2,5,15". (note: no square brackets).
When ES indexes this, I can't search for contentTypeIds:5 and find documents whose contentTypeIds contain 5; this query returns only the documents whose contentTypeIds is exactly "5". However, I do want to find documents whose contentTypeIds contain 5.
In Solr, this is solved by setting the contentTypeIds field to multiValued="true" in the schema.xml. I can't find how to do something similar in ES.
I'm new to ES, so I probably missed something. Thanks for your help!
Create custom analyzer which will split indexed text into tokens by commas.
Then you can try to search. In case you don't care about relevance you can use filter to search through your documents. My example shows how you can attempt search with term filter.
Below you can find how to do this with sense plugin.
DELETE testindex
PUT testindex
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
}
}
}
}
}
PUT /testindex/_mapping/yourtype
{
"properties" : {
"contentType" : {
"type" : "string",
"analyzer" : "comma"
}
}
}
PUT /testindex/yourtype/1
{
"contentType" : "1,2,3"
}
PUT /testindex/yourtype/2
{
"contentType" : "3,4"
}
PUT /testindex/yourtype/3
{
"contentType" : "1,6"
}
GET /testindex/_search
{
"query": {"match_all": {}}
}
GET /testindex/_search
{
"filter": {
"term": {
"contentType": "6"
}
}
}
Hope it helps.
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
"-",
"\n",
","
]
},
"text": "QUICK,brown, fox"
}

Elastic exact match w/o changing indexing

I have following query to elastic:
"query": {
"filtered": {
"filter": {
"and": {
"filters": [
{
"term": {
"entities.hashtags": "gf"
}
}
]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
},
entities.hashtags is array and as a result I receive entries with hashtags gf_anime, gf_whatever, gf_foobar etc.
But what I need is receive entries where exact "gf" hashtag exists.
I've looked in other questions on SO and saw that the solution in this case is to change analyzing of entities.hashtags so it'll match only exact values (I am pretty new with elastic hence can mistake with terms here).
My question is whether it's possible to define exact match search INSIDE THE QUERY? Id est w/o changing how elastic indexes its fields?
Are you sure that you need to do anything? Given your examples, you don't and you probably don't want to do not_analyzed:
curl -XPUT localhost:9200/test -d '{
"mappings": {
"test" : {
"properties": {
"body" : { "type" : "string" },
"entities" : {
"type" : "object",
"properties": {
"hashtags" : {
"type" : "string"
}
}
}
}
}
}
}'
curl -XPUT localhost:9200/test/test/1 -d '{
"body" : "anime", "entities" : { "hashtags" : "gf_anime" }
}'
curl -XPUT localhost:9200/test/test/2 -d '{
"body" : "anime", "entities" : { "hashtags" : ["GF", "gf_anime"] }
}'
curl -XPUT localhost:9200/test/test/3 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf_whatever", "gf_anime"] }
}'
With the above data indexed, your query only returns document 2 (note: this is simplified version of your query without the unnecessary/undesirable and filter; at least for the time being, you should always use the bool filter rather than and/or as it understands how to use the filter caches):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"entities.hashtags": "gf"
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
Where this breaks down is when you start putting in hashtag values that will be split into multiple tokens, thereby triggering false hits with the term filter. You can determine how the field's analyzer will treat any value by passing it to the _analyze endpoint and telling it the field to use the analyzer from:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf_anime'
{
"tokens" : [ {
"token" : "gf_anime",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
# Note the space instead of the underscore:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf anime'
{
"tokens" : [ {
"token" : "gf",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "anime",
"start_offset" : 3,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
If you were to add a fourth document with the "gf anime" variant, then you will get a false hit.
curl -XPUT localhost:9200/test/test/4 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf whatever", "gf anime"] }
}'
This is really not an indexing problem, but a bad data problem.
With all of the explanation out of the way, you can inefficiently solve this by using a script that always follows the term filter (to efficiently rule out the more common cases that don't hit it):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"bool" : {
"must" : [{
"term" : {
"entities.hashtags" : "gf"
}
},
{
"script" : {
"script" :
"_source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null",
"params" : {
"tag" : "gf"
}
}
}]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
This works by parsing the original the _source (and not using the indexed doc values). That is why this is not going to be very efficient, but it will work until you reindex. The _source.entities.hashtags == tag portion is only necessary if hashtags is not always an array (in my example, document 1 would not be an array). If it is always an array, then you can use _source.entities.hashtags.contains(tag) instead of _source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null.
Note: The script language is Groovy, which is the default starting in 1.4.0. It is not the default in earlier versions, and it must be explicitly enabled using script.default_lang : groovy.

Index fields with hyphens in Elasticsearch

I'm trying to work out how to configure elasticsearch so that I can make query string searches with wildcards on fields that include hyphens.
I have documents that look like this:
{
"tags":[
"deck-clothing-blue",
"crew-clothing",
"medium"
],
"name":"Crew t-shirt navy large",
"description":"This is a t-shirt",
"images":[
{
"id":"ba4a024c96aa6846f289486dfd0223b1",
"type":"Image"
},
{
"id":"ba4a024c96aa6846f289486dfd022503",
"type":"Image"
}
],
"type":"InventoryType",
"header":{
}
}
I have tried to use a word_delimiter filter and a whitespace tokenizer:
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"tags_filter" : {
"type" : "word_delimiter",
"type_table": ["- => ALPHA"]
}
},
"analyzer" : {
"tags_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["tags_filter"]
}
}
}
},
"mappings" : {
"yacht1" : {
"properties" : {
"tags" : {
"type" : "string",
"analyzer" : "tags_analyzer"
}
}
}
}
}
But these are the searches (for tags) and their results:
deck* -> match
deck-* -> no match
deck-clo* -> no match
Can anyone see where I'm going wrong?
Thanks :)
The analyzer is fine (though I'd lose the filter), but your search analyzer isn't specified so it is using the standard analyzer to search the tags field which strips out the hyphen then tries to query against it (run curl "localhost:9200/_analyze?analyzer=standard" -d "deck-*" to see what I mean)
basically, "deck-*" is being searched for as "deck *" there is no word that has just "deck" in it so it fails.
"deck-clo*" is being searched for as "deck clo*", again there is no word that is just "deck" or starts with "clo" so the query fails.
I'd make the following modifications
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "whitespace",
"filter" : ["lowercase"] <--- you don't need this, just thought it was a nice touch
}
}
}
then get rid of the special analyzer on the tags
"mappings" : {
"yacht1" : {
"properties" : {
"tags" : {
"type" : "string"
}
}
}
}
let me know how it goes.

Resources