Using Elasticsearch to search special characters - elasticsearch

How can I force Elasticsearch query_string to recognize '#' as a simple character?
Assuming I have an Index, and I added a few documents, by this statement:
POST test/item/_bulk
{"text": "john.doe#gmail.com"}
{"text": "john.doe#outlook.com"}
{"text": "john.doe#gmail.com, john.doe#outlook.com"}
{"text": "john.doe[at]gmail.com"}
{"text": "john.doe gmail.com"}
I want this search:
GET test/item/_search
{
"query":
{
"query_string":
{
"query": "*#gmail.com",
"analyze_wildcard": "true",
"allow_leading_wildcard": "true",
"default_operator": "AND"
}
}
}
to return only the first and third documents.
I tried 3 kinds of mapping:
First i tried -
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"email_analyzer": {
"tokenizer": "email_tokenizer"
}
},
"tokenizer": {
"email_tokenizer": {
"type": "uax_url_email"
}
}
}
},
"mappings": {
"item": {
"properties": {
"text": {
"type": "string",
"analyzer": "email_analyzer"
}
}
}
}
}
than i tried -
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "whitespace"
}
}
}
},
"mappings": {
"item": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
and i also tried this one -
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "whitespace"
}
}
}
},
"mappings": {
"item": {
"properties": {
"text": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
None of the above worked, actually they all returned all the documents.
Is there an analyzer/tokenizer/parameter that will make Elasticsearch to acknowledge the '#' sign like it does with any other character

This is working with your last setting, by putting the text to not analyze:
GET test/item/_search
{
"query":
{
"wildcard":
{
"text": "*#gmail.com*"
}
}
}
When using not analyzed field, you should use Term level query but not Full-Text level query: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/term-level-queries.html

Related

Elasticsearch exclude "stop" words from highlight

I want to exclude the default stop words from being highlighted but I'm not sure why this isn't working
ES config
"settings": {
"analysis": {
"analyzer": {
"search_synonyms": {
"tokenizer": "whitespace",
"filter": [
"graph_synonyms",
"lowercase",
"asciifolding",
"stop"
],
}
},
"filter": {
"graph_synonyms": {
...
}
},
"normalizer": {
"normalizer_1": {
...
}
}
}
},
Fields mapping:
"mappings": {
"properties": {
"description": {
"type": "text",
"analyzer": "search_synonyms"
},
"narrative": {
"type":"object",
"properties":{
"_all":{
"type": "text",
"analyzer": "search_synonyms"
}
}
},
"originator": {
"type": "keyword",
"normalizer": "normalizer_1"
},
................
}
}
Highlight query:
highlight : {
fields:{
"*":{}
}
},
Currently I'm getting stop words such as this, A, IS highlighted within narrative fields and I want to prevent that.

Count n-grams with token_count field

Is it possible to count number of produced n-grams using token_count field?
Let's suppose I have the following mapping:
{
"settings": {
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "trigrams_filter"]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"message": {
"type": "text",
"analyzer": "trigrams",
"fields": {
"length": {
"type": "token_count",
"analyzer": "trigrams"
}
}
}
}
}
}
}
With this mapping I'd expect to get three terms for value "quick": "qui", "uic" and "ick", but the following query doesn't return any hit despite the fact that message.length field has trigrams analyzer:
{
"query": {
"term": {
"message.length": 3
}
}
}

Matching closest ancestor with Path Hierarchy Tokenizer

I've got an Elasticsearch v5 index set up for mapping config hashes to URLs.
{
"settings": {
"analysis": {
"analyzer": {
"url-analyzer": {
"type": "custom",
"tokenizer": "url-tokenizer"
}
},
"tokenizer": {
"url-tokenizer": {
"type": "path_hierarchy",
"delimiter": "/"
}
}
}
},
"mappings": {
"route": {
"properties": {
"uri": {
"type": "string",
"index": "analyzed",
"analyzer": "url-analyzer"
},
"config": {
"type": "object"
}}}}}
I would like to match the longest path prefix with the highest score, so that given the documents
{ "uri": "/trousers/", "config": { "foo": 1 }}
{ "uri": "/trousers/grey", "config": { "foo": 2 }}
{ "uri": "/trousers/grey/lengthy", "config": { "foo": 3 }}
when I search for /trousers, the top result should be trousers, and when I search for /trousers/grey/short the top result should be /trousers/grey.
Instead, I'm finding that the top result for /trousers is /trousers/grey/lengthy.
How can I index and query my documents to achieve this?
I have one solution, after drinking on it: what if we treat the URI in the index as a keyword, but still use the PathHierarchyTokenizer on the search input?
Now we store the following docs:
/trousers
/trousers/grey
/trousers/grey/lengthy
When we submit a query for /trousers/grey/short, the search_analyzer can build the input [trousers, trousers/grey, trousers/grey/short].
The first two of our documents will match, and we can trivially select the longest match using a custom sort.
Now our mapping document looks like this:
{
"settings": {
"analysis": {
"analyzer": {
"uri-analyzer": {
"type": "custom",
"tokenizer": "keyword"
},
"uri-query": {
"type": "custom",
"tokenizer": "uri-tokenizer"
}
},
"tokenizer": {
"uri-tokenizer": {
"type": "path_hierarchy",
"delimiter": "/"
}
}
}},
"mappings": {
"route": {
"properties": {
"uri": {
"type": "text",
"fielddata": true,
"analyzer": "uri-analyzer",
"search_analyzer": "uri-query"
},
"config": {
"type": "object"
}
}
}
}
}
```
and our query looks like this:
{
"sort": {
"_script": {
"script": "doc.uri.length",
"order": "asc",
"type": "number"
}
},
"query": {
"match": {
"uri": {
"query": "/trousers/grey/lengthy",
"type": "boolean"
}
}
}
}

Google type query using Elasticsearch

Suppose I have the following document:
{title:"Sennheiser HD 800"}
I want to all this queries return this document.
senn
heise
sennheise
sennheiser
sennheiser 800
sennheiser hd
hd
800 hd
hd ennheise
In short I want to find partial words either one or more.
In my map i am using this analyzer
{
"settings": {
"analysis": {
"analyzer": {
"case_insensitive_sort": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
}
}
and the map
{
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lower_case_sort": {
"type": "string",
"analyzer": "case_insensitive_sort"
}
}
}
}
and the query is a simple string query
{
"query": {
"query_string": {
"fields": [
"title.lower_case_sort"
],
"query": "*800 hd*"
}
}
}
For example this query fails.
You need ngrams.
Here is a blog post I wrote up about it for Qbox:
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
(Note that "index_analyzer" no longer works in ES 2.x; use "analyzer" instead; "search_analyzer" still works, though.)
Using this mapping (slightly modified from one in the blog post; I'll refer you there for an in-depth explanation):
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"analyzer": "ngram_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
index your document:
POST /test_index/doc/1
{
"title": "Sennheiser HD 800"
}
and then any of your listed queries work, in the following form:
POST /test_index/_search
{
"query": {
"match": {
"title": {
"query": "heise hd 800",
"operator": "and"
}
}
}
}
If you only have a single term, then you don't need the "operator" part:
POST /test_index/_search
{
"query": {
"match": {
"title": "hd"
}
}
}
Here is some code I used to play around with it:
http://sense.qbox.io/gist/a9accf67f1713ca99819f45ce0ac28adaea691a9

In ES how to write mappings so that to use wildcard query for both lowercase as well as uppercase?

Hello all i am facing two problems in ES
I have a 'city' 'New York' in ES now i want to write a term filter such that if given string exactly matches "New York" then only it returns but what is happening is that when my filter matches "New" OR "York" for both it returns "New York" but it is not returning anything for "New York" my mapping is given below please tell me which analyzer or tokenizer should i use inside mapping
Here are the settings and mapping:
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": ["synonym"]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
},
mappings : {
"restaurant" : {
properties:{
address : {
properties:{
city : {"type" : "string", "analyzer": "synonym"},
}
}
}
}
Second problem is that when i am trying to use wildcard query on lowercase example "new*" then ES is not returning not anything but when i am trying to search uppercase example "New*" now it is returning "New York" now i in this second case i want to write my city mappings such that when i search for lowercase or uppercase for both ES returns the same thing i have seen ignore case and i have set it to false inside synonyms but still i am not able to search for both lowercase and uppercases.
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt",
"ignore_case": true // See here
}
I believe you didn't provide enough details, but hoping that my attempt will generate questions from you, I will post what I believe it should be a step forward:
The mapping:
PUT test
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
},
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt",
"ignore_case": true
}
}
}
}
},
"mappings": {
"restaurant": {
"properties": {
"address": {
"properties": {
"city": {
"type": "string",
"analyzer": "synonym",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"raw_ignore_case": {
"type": "string",
"analyzer": "keyword_lowercase"
}
}
}
}
}
}
}
}
}
Test data:
POST /test/restaurant/1
{
"address": {"city":"New York"}
}
POST /test/restaurant/2
{
"address": {"city":"new york"}
}
Query for the first problem:
GET /test/restaurant/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"address.city.raw": "New York"
}
}
}
}
}
Query for the second problem:
GET /test/restaurant/_search
{
"query": {
"query_string": {
"query": "address.city.raw_ignore_case:new*"
}
}
}

Resources