Get exact match after doing mapping as not_analyzed - elasticsearch

I have elasticsearch type I mapped as below,
mappings": {
"jardata": {
"properties": {
"groupID": {
"index": "not_analyzed",
"type": "string"
},
"artifactID": {
"index": "not_analyzed",
"type": "string"
},
"directory": {
"type": "string"
},
"jarFileName": {
"index": "not_analyzed",
"type": "string"
},
"version": {
"index": "not_analyzed",
"type": "string"
}
}
}
}
I am using index of directory as analyzed since I want give only the last folder and get the results, But when I want to search a specific directory I need to give the whole path since there can be same folder in two paths. The problem here is since it is analyzed it will all data instead the specific one I want.
The problem here is I want to act it like both analyzed and not_analyzed. is there a way for that?

Let's say you have the following document indexed:
{
"directory": "/home/docs/public"
}
The standard analyzer is not enough in your case as it will create following terms while indexing:
[home, docs, public]
Note that it misses [/home/docs/public] token - characters like "/" etc. are acting as separators here.
One solution could be to use NGram tokenizer with punctuation character class in token_chars list. Elasticsearch would treat "/" as it would be a letter or digit. This would allow to search with following tokens:
[/hom, /home, ..., /home/docs/publi, /home/docs/public, ..., /docs/public, etc...]
Index mapping:
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 18,
"token_chars": [
"letter",
"digit",
"punctuation"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "ngram_analyzer"
}
}
}
}
}
Now both search queries:
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/docs/private"
}
}
}
}
}
and
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/home/docs/private"
}
}
}
}
}
will give the indexed document in result.
One thing you have to consider is the maximum length of the token that is specified in "max_gram" setting. In case of directory paths it could be necessary to have it longer.
Alternative solution is to use Whitespace tokenizer, that breaks the phrase into terms only on whitespaces, and NGram filter with following mapping:
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 4,
"max_gram": 20
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}

update the mapping of the directory field to contain raw field like this:
"directory": {
"type": "string",
"fields": {
"raw": {
"index": "not_analyzed",
"type": "string"
}
}
}
And modify your query to include directory.raw which will treat it like not_analyzed. Refer this.

Related

Wildcard / regexp in a phrase which has space

Create an index:
Here I an using edge_ngram
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "keyword",
"fields": {
"raw": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
POST my_index/my_type/1
{
"text": "2 #Quick Foxes lived and died"
}
POST my_index/my_type/2
{
"text": "2 #Quick Foxes lived died"
}
Now when we search
GET my_index/my_type/_search
{
"query": {
"query_string": {
"default_operator" : "AND",
"query" : "f* d*",
"fields": ["text.raw"]
}
}
}
Only ID 2 should list. But nothing returns.
when you try this
GET my_index/my_type/_search
{
"query": {
"query_string": {
"default_operator" : "AND",
"query" : "f* d*",
"fields": ["text"]
}
}
}
It will return both.
If we have an index with huge data and if we wanted to search wildcards, how we will do it?
single keyword will work, but if we add phrases like which i mentioned in the example, it won't give you any proper result.
To generate a regex expression you can follow these websites:-
Generate regex expression here- http://buildregex.com/
and test your string with expression generated from here https://regex101.com/

Elasticsearch: index first char of string

I'm using version 5.3.
I have a text field a. I'd like to aggregate on the first char of a. I also need the entire original value.
I'm assuming the most efficient way is to have a keyword field a.firstLetter with a custom normalizer. I've tried to achieve this with a pattern replace char filter but am struggling with the regexp.
Am I going at this entirely wrong? Can you help me?
EDIT
This is what I've tried.
settings.json
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"first_char": {
"type": "pattern_replace",
"pattern": "(?<=^.)(.*)",
"replacement": ""
}
}
"normalizer": {
"first_letter": {
"type": "custom",
"char_filter": ["first_char"]
"filter": ["lowercase"]
}
}
}
}
}
}
mappings.json
{
"properties": {
"a": {
"type": "text",
"index_options": "positions",
"fields": {
"firstLetter": {
"type": "keyword",
"normalizer": "first_letter"
}
}
}
}
}
I get no buckets when I try to aggregate like so:
"aggregations": {
"grouping": {
"terms": {
"field": "a.firstLetter"
}
}
}
So basically my approach was "replace all but the first char with an empty string." The regexp is something I was able to gather by googling.
EDIT 2
I had misconfigured the normalizer (I've fixed the examples). The correct configuration reveals that normalizers do not support pattern replace char filters due to issue 23142. Apparently support for it will be implemented earliest in version 5.4.
So are there any other options? I'd hate to do this in code, by adding a field in the doc for the first letter, since I'm using Elasticsearch features for every other aggregation.
You can use the truncate filter with a length of one
PUT foo
{
"mappings": {
"bar" : {
"properties": {
"name" : {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "keyword",
"filter" : [ "my_filter", "lowercase" ]
}
},
"filter": {
"my_filter": {
"type": "truncate",
"length": 1
}
}
}
}
}
}
GET foo/_analyze
{
"field" : "name",
"text" : "New York"
}
# response
{
"tokens": [
{
"token": "n",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
}
]
}

ElasticSearch Reverse Wildcard Search

In ElasticSearch v5.2.2 I can search for "Jo*" using Wildcard and it will match the index value containing "Joseph"
But what if my index also has these values "Joseph","Jo", "Jos", "Jose" and "Josep" and I want to reverse the query.
How can I find "Jo", "Jos", "Jose" and "Josep" in the index using the string "Joseph" as search criteria?
That's possible, but you need to create an edgeNGram search analyzer in your index settings.
First create the settings like this. The name field will be indexed with the standard analyzer but searched with your custom prefix_search analyzer instead.
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"prefix_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"prefix"
]
}
},
"filter": {
"prefix": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "string",
"analyzer": "standard",
"search_analyzer": "prefix_search"
}
}
}
}
}
Then if you create a document like this:
PUT test/doc/1
{
"name": "Jos"
}
You can find it with a query like this one:
POST /test/doc/_search
{
"query": {
"match": {
"name": "Joseph"
}
}
}

wildcard on different tokens in elastic search

I have a document which looks like this
Name
Thomy tyson Olando Magua
Using ngram i was able to acheive the wildcard search so that if i type in omy tyson it can return me the above document pretty much similar to this sql query
select name from table where name like '%omy tyson%'
PUT sample
{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "15"
}
}
}
},
"mappings": {
"typename": {
"properties": {
"name": {
"type": "string",
"fields": {
"search": {
"type": "string",
"analyzer": "my_ngram_analyzer"
}
}
}
}
}
}
}
PUT sample/typename/2
{
"name": "Thomy tyson Olando Magua"
}
{
"query": {
"bool": {
"should": [
{
"term": {
"name.search": "omy tyson"
}
}
]
}
}
}
Is there a way in elastic search where i can perform wildcard search on 2 different words separated by other words like
select name from table where name like '%omy Magua%'
So in this case i would like to perform partial search on first and fourth word.
Any feedback would be helpfull

"Letter" tokenizer and "word_delimiter" filter not working with underscores

I built an ElasticSearch index using a custom analyzer which uses letter tokenizer and lower_case and word_delimiter token filters. Then I tried searching for documents containing underscore-separated sub-words, e.g. abc_xyz, using only one of the sub-words, e.g. abc, but it didn't come back with any result. When I tried the full-word, i.e. abc_xyz, it did find the document.
Then I changed the document to have dash-separated sub-words instead, e.g. abc-xyz and tried to search by sub-words again and it worked.
To try to understand what is going on, I thought I would check the terms generated for my documents using _termvector service, and the result was identical for both, the underscore-separated sub-words and the dash-separated sub-words, so really I expect the result of searching to be identical in both cases.
Any idea what I could be doing wrong?
If it helps, this is the settings I used for my index:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"cmt_value_analyzer": {
"tokenizer": "letter",
"filter": [
"lowercase",
"my_filter"
],
"type": "custom"
}
},
"filter": {
"my_filter": {
"type": "word_delimiter"
}
}
}
}
},
"mappings": {
"alertmodel": {
"properties": {
"name": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"productId": {
"type": "double"
},
"productName": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"link": {
"analyzer": "cmt_value_analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "string"
},
"updatedOn": {
"type": "date"
}
}
}
}
}

Resources