Elastic search - no hit though there should be result - elasticsearch

I've encountered the following problem with Elastic search, does anyone know where should I troubleshoot?
I'm happily retrieving result with the following query:
{
"query" : {
"match" : { "name" : "A1212001" }
}
}
But when I refine the value of the search field "name" to a substring, i've not no hit?
{
"query" : {
"match" : { "name" : "A12120" }
}
}
"A12120" is a substring of already hit query "A1212001"

If you don't have too many documents, you can go with a regexp query
POST /index/_search
{
"query" :{
"regexp":{
"name": "A12120.*"
}
}
}
or even a wildcard one
POST /index/_search
{
"query": {
"wildcard" : { "name" : "A12120*" }
}
}
However, as #Waldemar suggested, if you have many documents in your index, the best approach for this is to use an EdgeNGram tokenizer since the above queries are not ultra-performant.
First, you define your index settings like this:
PUT index
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type": "custom",
"tokenizer" : "edge_tokens",
"filter": ["lowercase"]
}
},
"tokenizer" : {
"edge_tokens" : {
"type" : "edgeNGram",
"min_gram" : "1",
"max_gram" : "10",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
Then, when indexing a document whose name field contains A1212001, the following tokens will be indexed: A, A1, A12, A121, A1212, A12120, A121200, A1212001. So when you'll search for A12120 you'll find a match.

Are you using a Match Query this query will check for terms inside lucene and your term is A1212001 if you need to find a part of your term do you can use a Regex Query but you need know that will be there some internal impacts using regex because the shard will check in all of your terms.
If you need a more "professional" way to search a part of a term do you can use NGrams

Related

Query to partially match every word in a search term in Elasticsearch

I have an array of tags containing words.
tags: ['australianbrownsnake', 'venomoussnake', ...]
How do I match this against these search terms:
'brown snake', 'australian snake', 'venomous', 'venomous brown snake'
I am not even sure if this is possible since I am new to Elasticsearch.
Help would be appreciated. Thank you.
Edit: I have created an ngram analyzer and added a field called ngram like so.
properties": {
"tags": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
i tried the following query but no luck
"query": {
"multi_match": {
"query": "snake",
"fields": [
"tags.ngram"
],
"type": "most_fields"
}
}
my tag mapping is as follows:
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
},
"ngram" : {
"type" : "text",
"analyzer" : "my_analyzer"
}
}
},
my settings are:
{
"image" : {
"settings" : {
"index" : {
"max_ngram_diff" : "10",
"number_of_shards" : "1",
"provided_name" : "image",
"creation_date" : "1572590562106",
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "my_tokenizer"
}
},
"tokenizer" : {
"my_tokenizer" : {
"token_chars" : [
"letter",
"digit"
],
"min_gram" : "3",
"type" : "ngram",
"max_gram" : "10"
}
}
},
"number_of_replicas" : "1",
"uuid" : "pO9F7W43QxuZmI9vmXfKyw",
"version" : {
"created" : "7040299"
}
}
}
}
}
Update:
This config should work fine.
I believe it was my mistake. I was searching on the wrong index
You need to index your tags in the way you want to search them. For queries like 'brown snake', 'australian snake' to match your tags you would need to break them into smaller tokens.
By default elasticsearch indexes strings by passing it through its standard analyzer. You can always create your custom analyzer to store your field however you want. You can create your custom analyzer which tokenizes strings into nGrams. You can specify a size of 3-10 which will store your 'australianbrownsnake' tag as something like: ['aus', 'aust', ..., 'tra', 'tral',...]
You can then modify your search query to match on your tags.ngram field and you should get the desired results.
tags.ngrams field can be created like so:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
using ngram tokenizer:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
EDIT1: Elastic tends to use the analyzer of the field being matched on, to analyze the query keywords. You might not need the user query to be tokenized in nGrams since there should be a matching nGram stored in the tags field. You could specify a standard search_analyzer in your mappings.

Match fails elasticsearch

I have the following index in which I index mail addresses.
PUT _myindex
{
"settings" : {
"analysis" : {
"filter" : {
"email" : {
"type" : "pattern_capture",
"preserve_original" : true,
"patterns" : [
"^(.*?)#",
"(\\w+(?=.*#))"]
}
},
"analyzer" : {
"email" : {
"tokenizer" : "uax_url_email",
"filter" : [ "lowercase","email", "unique" ]
}
}
}
},
"mappings": {
"emails": {
"properties": {
"email": {
"type": "text",
"analyzer": "email"
}
}
}
}
My e-mail in the following form "example.elastic#yahoo.com". When i index them they get analysed like example.elastic#yahoo.com, example.elastic, elastic, example.
When i run a match
GET _myindex/_search
{
"query": {
"match": {
"email": "example.elastic#yahoo.com"
}
}
}
or using as a query string example, elastic, Elastic it works and retrieves results. But the problem is when I have "example.elastic.blabla#yahoo.com", it also returns the same results. What can be the problem?
Using term query instead of match query will solve this.
Reason is, The match query will apply analyzer to the search term and will therefore match what is stored in the index. The term query does not apply any analyzers to the search term, so will only look for that exact term in the index.
Ref: https://stackoverflow.com/a/23151332/6546289
GET _myindex/_search
{
"query": {
"term": {
"email": "example.elastic#yahoo.com"
}
}
}

Undesired Stopwords in Elastic Search

I am using Elastic Search 6.This is query
PUT /semtesttest
{
"settings": {
"index" : {
"analysis" : {
"filter": {
"my_stop": {
"type": "stop",
"stopwords_path": "analysis1/stopwords.csv"
},
"synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis1/synonym.txt"
}
},
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["synonym","my_stop"]
}
}
}
}
},
"mappings": {
"all_questions": {
"dynamic": "strict",
"properties": {
"kbaid":{
"type": "integer"
},
"answer":{
"type": "text"
},
"question": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
PUT /semtesttest/all_questions/1
{
"question":"this is hippie"
}
GET /semtesttest/all_questions/_search
{
"query":{
"fuzzy":{"question":{"value":"hippie","fuzziness":2}}
}
}
GET /semtesttest/all_questions/_search
{
"query":{
"fuzzy":{"question":{"value":"this is","fuzziness":2}}
}
}
in synonym.txt it is
this, that, money => sainai
in stopwords.csv it is
hello
how
are
you
The first get ('hippie') return empty
only the second get ('this is') return results
what is the problem? It looks like the stop word "this is" is filtered in the first query, but I have specified my stop words explicitly?
fuzzy is a term query. It is not going to analyze the input, so your query was looking for the exact term this is (applying some fuzzy fun).
So you either want to build a query off those two terms, or use a full text query instead. If fuzziness is important, I think the only full text query is match:
GET /semtesttest/all_questions/_search?pretty
{
"query":{
"match":{"question":{"query":"this is","fuzziness":2}}
}
}
If match phrases is important, you may want to look at this answer and work with span queries.
This might also help you so you can see how your analyzer is being used:
GET /semtesttest/_analyze?analyzer=my_analyzer&field=question&text=this is

bidirectional match on elasticsearch

I've indexed a list of terms and now I want to query for some of them
Say that I've indexed 'dog food','red dog','dog','food','cats'
How do I create an exact bidirectional match query. ie: I want when search for 'dog' to get only the term dog and not the other terms (because they don't match back).
One primitive solution I thought of is indexing the terms with their length (Words-wise) and then when searching query with lengh X restrict it to the terms of length X. but it seems over complicated.
Create a custom analyzer to lowercase and normalize your search terms. So that would be your index:
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer_keyword" : {
"type" : "custom",
"tokenizer" : "keyword",
"filter" : [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings" : {
"your_type" : {
"properties" : {
"name" : {
"type" : "string",
"analyzer" : "my_analyzer_keyword"
}
}
}
}
}
So if you have indexed 'dog' and users types in Dog or dog or DOG, it will match only dog, 'dog food' won't be brought back.
Just set your field's index property to not_analyzed and your query should use term filter to search for text.
As per Evaldas' suggestion, find below a more complete solution, that also keeps the original value indexed with standard analyzer but uses a sub-field with a lowercased version of the terms:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"asset": {
"properties": {
"name": {
"type": "string",
"fields": {
"case_ignore": {
"type": "string",
"analyzer": "my_keyword_lowercase_analyzer"
}
}
}
}
}
}
}
POST /test/asset/1
{
"name":"dog"
}
POST /test/asset/2
{
"name":"dog food"
}
POST /test/asset/3
{
"name":"red dog"
}
GET /test/asset/_search
{
"query": {
"match": {
"name.case_ignore": "Dog"
}
}
}

Partial Search using Analyzer in ElasticSearch

I am using elasticsearch to build the index of URLs.
I extracted one URL into 3 parts which is "domain", "path", and "query".
For example: testing.com/index.html?user=who&pw=no will be separated into
domain = testing.com
path = index.html
query = user=who&pw=no
There is problems when I wanted to partial search domain in my index such as "user=who" or "ing.com".
Is it possible to use "Analyzer" when I search even I didn't use "Analyzer" when indexing?
How can I do partial search based on the analyzer ?
Thank you very much.
2 approaches:
1. Wildcard search - easy and slow
"query": {
"query_string": {
"query": "*ing.com",
"default_field": "domain"
}
}
2. Use an nGram tokenizer - harder but faster
Index Settings
"settings" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "1",
"max_gram" : "50"
}
}
}
}
Mapping
"properties": {
"domain": {
"type": "string",
"index_analyzer": "my_ngram_analyzer"
},
"path": {
"type": "string",
"index_analyzer": "my_ngram_analyzer"
},
"query": {
"type": "string",
"index_analyzer": "my_ngram_analyzer"
}
}
Querying
"query": {
"match": {
"domain": "ing.com"
}
}
Trick with query string is split string like "user=who&pw=no" to tokens ["user=who&pw=no", "user=who", "pw=no"] at index time. That allows you to make easily queries like "user=who". You could do this with pattern_capture token filter, but there may be better ways to do this as well.
You can also make hostname and path more searchable with path_hierarchy tokenizer, for example "/some/path/somewhere" becomes ["/some/path/somewhere", "/some/path/", "/some"]. You can index also hostname with with path_hierarchy hierarcy tokenizer by using setting reverse: true and delimiter: ".". You may also want to use some stopwords-filter to exclude top-level domains.

Resources