Emails not being searched properly in elasticsearch - elasticsearch

I have indexed a few documents in elasticsearch which have email ids as a field. But when I query for a specific email id, the search results are showing all the documents without filtering.
This is the query I have used
{
"query": {
"match": {
"mail-id": "abc#gmail.com"
}
}
}

By default, your mail-id field is analyzed by the standard analyzer which will tokenize the email abc#gmail.com into the following two tokens:
{
"tokens" : [ {
"token" : "abc",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "gmail.com",
"start_offset" : 4,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
What you need instead is to create a custom analyzer using the UAX email URL tokenizer, which will tokenize email addresses as a one token.
So you need to define your index as follows:
curl -XPUT localhost:9200/people -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email"
}
}
}
},
"mappings": {
"person": {
"properties": {
"mail-id": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
After creating that index, you can see that the email abc#gmail.com will be tokenized as a single token and your search will work as expected.
curl -XGET 'localhost:9200/people/_analyze?analyzer=my_analyzer&pretty' -d 'abc#gmail.com'
{
"tokens" : [ {
"token" : "abc#gmail.com",
"start_offset" : 0,
"end_offset" : 13,
"type" : "<EMAIL>",
"position" : 1
} ]
}

This happens when you use the default mappings. Elasticsearch has uax_url_email tokenizers which would identify the urls and emails as a single entity/token.
You can read more about this here and here

Related

Tokens in Index Time vs Query Time are not the same when using common_gram filter ElasticSearch

I want to use common_gram token filter based on this link.
My elasticsearch version is: 7.17.8
Here is the setting of my index in ElasticSearch.
I have defined a filter named "common_grams" that uses "common_grams" as type.
I have defined a custom analyzer named "index_grams" that use "whitespace" as tokenizer and the above filter as a token filter.
I have just one field named as "title_fa" and I have used my custom analyzer for this field.
PUT /my-index-000007
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": [ "common_grams" ]
}
},
"filter": {
"common_grams": {
"type": "common_grams",
"common_words": [ "the","is" ]
}
}
}
}
,
"mappings": {
"properties": {
"title_fa": {
"type": "text",
"analyzer": "index_grams",
"boost": 40
}
}
}
}
It works fine in Index Time and the tokens are what I expect to be. Here I get the tokens via kibana dev tool.
GET /my-index-000007/_analyze
{
"analyzer": "index_grams",
"text" : "brown is the"
}
Here is the result of the tokens for the text.
{
"tokens" : [
{
"token" : "brown",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "brown_is",
"start_offset" : 0,
"end_offset" : 8,
"type" : "gram",
"position" : 0,
"positionLength" : 2
},
{
"token" : "is",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "is_the",
"start_offset" : 6,
"end_offset" : 12,
"type" : "gram",
"position" : 1,
"positionLength" : 2
},
{
"token" : "the",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 2
}
]
}
When I search the query "brown is the", I expect these tokens to be searched:
["brown", "brown_is", "is", "is_the", "the" ]
But these are the tokens that will actually be searched:
["brown is the", "brown is_the", "brown_is the"]
Here you can see the details
Query Time Tokens
UPDATE:
I have added a sample document like this:
POST /my-index-000007/_doc/1
{ "title_fa" : "brown" }
When I search "brown coat"
GET /my-index-000007/_search
{
"query": {
"query_string": {
"query": "brown is coat",
"default_field": "title_fa"
}
}
}
it returns the document because it searches:
["brown", "coat"]
When I search "brown is coat", it can't find the document because it is searching for
["brown is coat", "brown_is coat", "brown is_coat"]
Clearly when it gets a query that contains a common word, it acts differently and I guess it's because of the index time tokens and query time tokens.
Do you know where I am getting this wrong? Why is it acting differently?

How to get index item that has : "name" - "McLaren" by searching with "mclaren" in Elasticsearch 1.7?

Here is the tokenizer -
"tokenizer": {
"filename" : {
"pattern" : "[^\\p{L}\\d]+",
"type" : "pattern"
}
},
Mapping -
"name": {
"type": "string",
"analyzer": "filename_index",
"include_in_all": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lower_case_sort": {
"type": "string",
"analyzer": "naturalsort"
}
}
},
Analyzer -
"filename_index" : {
"tokenizer" : "filename",
"filter" : [
"word_delimiter",
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer",
"czech_stop",
"czech_keywords",
"czech_stemmer"
]
},
I would like to get index item by searching - mclaren, but the name indexed is McLaren.
I would like to stick to query_string cause a lot of other functionality is based on that. Here is the query with what I cant get the expected result -
{
"query": {
"filtered": {
"query": {
"query_string" : {
"query" : "mclaren",
"default_operator" : "AND",
"analyze_wildcard" : true,
}
}
}
},
"size" :50,
"from" : 0,
"sort": {}
}
How I could accomplish this? Thank you!
I got it ! The problem is certainly around the word_delimiter token filter.
By default it :
Split tokens at letter case transitions. For example: PowerShot →
Power, Shot
Cf documentation
So macLaren generate two tokens -> [mac, Laren] when maclaren only generate one token ['maclaren'].
analyze example :
POST _analyze
{
"tokenizer": {
"pattern": """[^\p{L}\d]+""",
"type": "pattern"
},
"filter": [
"word_delimiter"
],
"text": ["macLaren", "maclaren"]
}
Response:
{
"tokens" : [
{
"token" : "mac",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "Laren",
"start_offset" : 3,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "maclaren",
"start_offset" : 9,
"end_offset" : 17,
"type" : "word",
"position" : 102
}
]
}
So I think one option is to configure your word_delimiter with the option split_on_case_change to false (see parameters doc)
Ps: remeber to remove the settings you previously added (cf comments), since with this setting, your query string query will only target the name field that does not exists.

CamelCase Search with Elasticsearch

I want to configure Elasticsearch, so that searching for "JaFNam" will create a good score for "JavaFileName".
I'm tried to build an analyzer, that combines a CamelCase pattern analyzer with an edge_ngram tokenizer. I thought this would create terms like these:
J F N Ja Fi Na Jav Fil Nam Java File Name
But the tokenizer seems not to have any effect: I keep getting these terms:
Java File Name
What would the correct Elasticsearch configuration look like?
Example code:
curl -XPUT 'http://127.0.0.1:9010/hello?pretty=1' -d'
{
"settings":{
"analysis":{
"analyzer":{
"camel":{
"type":"pattern",
"pattern":"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])",
"filters": ["edge_ngram"]
}
}
}
}
}
'
curl -XGET 'http://127.0.0.1:9010/hello/_analyze?pretty=1' -d'
{
"analyzer":"camel",
"text":"JavaFileName"
}'
results in:
{
"tokens" : [ {
"token" : "java",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}, {
"token" : "file",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "name",
"start_offset" : 8,
"end_offset" : 12,
"type" : "word",
"position" : 2
} ]
}
You analyzer definition is not correct. you need a tokenizer and an array of filter, as it is your analyzer doesn't work. Try like this instead:
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"tokenizer": "my_pattern",
"filter": [
"my_gram"
]
}
},
"filter": {
"my_gram": {
"type": "edge_ngram",
"max_gram": 10
}
},
"tokenizer": {
"my_pattern": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}

Elasticsearch custom analyzer not working

I am using elasticsearch as my search engine, I am now trying to create an custom analyzer to make the field value just lowercase. The following is my code:
Create index and mapping
create index with a custom analyzer named test_lowercase:
curl -XPUT 'localhost:9200/test/' -d '{
"settings": {
"analysis": {
"analyzer": {
"test_lowercase": {
"type": "pattern",
"pattern": "^.*$"
}
}
}
}
}'
create a mapping using the test_lowercase analyzer for the address field:
curl -XPUT 'localhost:9200/test/_mapping/Users' -d '{
"Users": {
"properties": {
"name": {
"type": "string"
},
"address": {
"type": "string",
"analyzer": "test_lowercase"
}
}
}
}'
To verify if the test_lowercase analyzer work:
curl -XGET 'localhost:9200/test/_analyze?analyzer=test_lowercase&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "\nbeijing china\n",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
} ]
}
As we can see, the string 'Beijing China' is indexed as a single lowercase-ed whole term 'beijing china', so the test_lowercase analyzer works fine.
To verify if the field 'address' is using the lowercase analyzer:
curl -XGET 'http://localhost:9200/test/_analyze?field=address&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "\nbeijing china\n",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
} ]
}
curl -XGET 'http://localhost:9200/test/_analyze?field=name&pretty' -d '
Beijing China
'
{
"tokens" : [ {
"token" : "beijing",
"start_offset" : 1,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "china",
"start_offset" : 9,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
As we can see, for the same string 'Beijing China', if we use field=address to analyze, it creates a single item 'beijing china', when using field=name, we got two items 'beijing' and 'china', so it seems the field address is using my custom analyzer 'test_lowercase'.
Insert a document to the test index to see if the analyzer works for documents
curl -XPUT 'localhost:9200/test/Users/12345?pretty' -d '{"name": "Jinshui Tang", "address": "Beijing China"}'
Unfortunately, the document has been successfully inserted but the address field has not been correctly analyzed. I can't search out it by using the wildcard query as follows:
curl -XGET 'http://localhost:9200/test/Users/_search?pretty' -d '
{
"query": {
"wildcard": {
"address": "*beijing ch*"
}
}
}'
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
List all terms analyzed for the document:
So I run the following commands to see all terms of the document, and I found that the 'Beijing China' is not in the term vector at all.
curl -XGET 'http://localhost:9200/test/Users/12345/_termvector?fields=*&pretty'
{
"_index" : "test",
"_type" : "Users",
"_id" : "12345",
"_version" : 3,
"found" : true,
"took" : 2,
"term_vectors" : {
"name" : {
"field_statistics" : {
"sum_doc_freq" : 2,
"doc_count" : 1,
"sum_ttf" : 2
},
"terms" : {
"jinshui" : {
"term_freq" : 1,
"tokens" : [ {
"position" : 0,
"start_offset" : 0,
"end_offset" : 7
} ]
},
"tang" : {
"term_freq" : 1,
"tokens" : [ {
"position" : 1,
"start_offset" : 8,
"end_offset" : 12
} ]
}
}
}
}
}
We can see that the name is correctly analyzed and it became two terms 'jinshui' and 'tang', but the address is lost.
Can anyone please help? Is there anything missing?
Thanks a lot!
To lowercase the text you don't need a pattern. Use something like this:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"test_lowercase": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
}
PUT /test/_mapping/Users
{
"Users": {
"properties": {
"name": {
"type": "string"
},
"address": {
"type": "string",
"analyzer": "test_lowercase"
}
}
}
}
PUT /test/Users/12345
{"name": "Jinshui Tang", "address": "Beijing China"}
And to verify you did the right thing, use this:
GET /test/Users/_search
{
"fielddata_fields": ["name", "address"]
}
And you will see exactly how Elasticsearch is indexing your data:
"fields": {
"name": [
"jinshui",
"tang"
],
"address": [
"beijing",
"china"
]
}

Elasticsearch, search for domains in urls

We index HTML documents which may include links to other documents. We're using elasticsearch and things are pretty smooth for most keyword searches, which is great.
Now, we're adding more complex searches similar to Google site: or link: searches: basically we want to retrieve documents which point to eithr specific urls or even domains. (If document A has a link to http://a.site.tld/path/, the search link:http://a.site.tld should yield it.).
And we're now trying what would be the best way to achieve this.
So far, we have extracted the links from the documents and added a links field to our document. We setup the links to be not analyzed. We can then do search that match the exact url link:http://a.site.tld/path/ But of course link:http://a.site.tld does not yield anything.
Our initial idea would be to create a new field linkedDomains which would work similarly... but there may exist better solutions?
You could try the Path Hierarchy Tokenizer:
Define a mapping as follows:
PUT /link-demo
{
"settings": {
"analysis": {
"analyzer": {
"path-analyzer": {
"type": "custom",
"tokenizer": "path_hierarchy"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"link": {
"type": "string",
"index_analyzer": "path-analyzer"
}
}
}
}
}
Index a doc:
POST /link-demo/doc
{
link: "http://a.site.tld/path/"
}
The following term query returns the indexed doc:
POST /link-demo/_search?pretty
{
"query": {
"term": {
"link": {
"value": "http://a.site.tld"
}
}
}
}
To get a feel for how this is being indexed:
GET link-demo/_analyze?analyzer=path-analyzer&text="http://a.site.tld/path"&pretty
Shows the following:
{
"tokens" : [ {
"token" : "\"http:",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "\"http:/",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "\"http://a.site.tld",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 1
}, {
"token" : "\"http://a.site.tld/path\"",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
} ]
}

Resources