Substring and similarity matching in elasticsearch - elasticsearch

I am learning to use elastisearch as alternative for database queries and I am not able to perform substring matches on the built index.
The mapping I have used to create index is
"mappings" : {
"user" : {
"properties" : {
"name" : {"type": "string"},
"specialty" : {"type": "string" ,"analyzer":"snowball"},
"address : {"type": "string" ,"analyzer":"snowball"}
}
}
}
The document I am indexing is
{
"name" : "John Doe",
"speciality": ["pediatrician","Child Doctor"],
"address": ["#123 park road Abbeyville","#423 park road AbbeyTown" ]
}
When I perform a search like
curl -XGET localhost:9200/test/user/_search?q=speciality:pediatrician
I get the correct document
However when I search strings like
curl -XGET localhost:9200/test/user/_search?q=speciality:pedia
curl -XGET localhost:9200/test/user/_search?q=speciality:pediatricians
No results are retrieved
P.S I know that wild cards can be used for matching but I need to be able to search for both the whole word and parts of the words based on user input so as to return the most relevant documents.

Did you try reindexing after changing the mapping? Also try setting the search analyzer to snowball in the settings.
UPDATE:
You can go for wild card search. Better go for trailing wild card search alone instead of both leading and trailing wild card search.
curl -XGET localhost:9200/test/user/_search?q=speciality:pedia*
curl -XGET localhost:9200/test/user/_search?q=speciality:pediatricians*

Related

Elasticsearch fuzziness with multi_match and bool_prefix type

I have a set of search_as_you_type_fields I need to search against. Here is my mapping
"mappings" : {
"properties" : {
"description" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"questions" : {
"properties" : {
"content" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
},
"title" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
}
}
I am using a multi_match query with bool_prefix type.
"query": {
"multi_match": {
"query": "triangle",
"type": "bool_prefix",
"fields": [
"title",
"title._2gram",
"title._3gram",
"description",
"description._2gram",
"description._3gram",
"questions.content",
"questions.content._2gram",
"questions.content._3gram",
"questions.tags",
"questions.tags._2gram",
"questions.tags._3gram"
]
}
}
So far works fine. Now I want to add a typo tolerance which is fuzziness in ES. However, looks like bool_prefix has some conflicts working with this. So if I modify my query and add "fuzziness": "AUTO" and make an error in a word "triangle" -> "triangld", it won't get any results.
However, if I am looking for a phrase "right triangle", I have some different behavior:
even if no typos is made, I got more results with just "fuzziness": "AUTO" (1759 vs 1267)
if I add a typo to the 2d word "right triangdd", it seems to work, however looks like it now pushes the results containing "right" without "triangle" first ("The Bill of Rights", "Due process and right to privacy" etc.) in front.
If I make a typo in the 1st word ("righd triangle") or both ("righd triangdd"), the results seems to be just fine. So this is probably the only correct behavior.
I've seen a couple of articles and even GitHub issues that fuzziness does not work in a proper way with a multi_match query with bool_prefix, however I can't find a workaround for this. I've tried changing the query type, but looks like bool_prefix is the only one that supports search as you type and I need to get search result as a user starts typing something.
Since I make all the requests from ES from our backend What I also can do is manipulate a query string to build different search query types if needed. For example, for 1 word searches use one type for multi use another. But I basically need to maintain current behavior.
I've also tried appending a sign "~" or "~1[2]" to the string which seems to be another way of specifying the fuzziness, but the results are rather unclear and performance (search speed) seems to be worse.
My questions are:
How can I achieve fuzziness for 1 word searches? so that query "triangld" returns documents containing "triangle" etc.
How can I achieve correct search results when the typo in the 2d (last?) word of the query? Like I mentioned above it works, but see the point 2 above
Why just adding a fuzziness (see p. 1) returns more results even if the phrase is correct?
Anything I need to change in my analyzers etc.?
so to achieve a desired behavior, we did the following:
change query type to "query_string"
added query string preprocessing on the backend. We split the query string by white spaces and add "~1" or "~2" to each word if their length is more 4 chars or 8 chars respectively. ~ is a fuzziness syntax in ES. However, we don't add this to the current typing word until the user types a white space. For example, user typing [t, tr, tri, ... triangle] => no fuzzy, but once "triangle " => "triangle~2". This is because there will be unexpected results with the last word having fuzziness
we also removed all ngram fields from the search fields as we get the same results but performance is a bit better.
added "default_operator": "AND" to the query to contain the results from one field for phrase queries

Elastic Search Multimatch: Is there a way to search all fields except one?

We have an Elastic Search structure that specifies fields in a multi_match query like this:
"multi_match": {
"query": "find this string",
"fields": ["*_id^20", "*_name^20", "*"]
}
This works great - except under certain circumstances like when query is "Find NOWAK". This is because "NOW" is a reserved word for date searching and field "*" matches fields that are defined as dates.
So what I would like to do is ignore fields that match "*_at".
Is there way to tell Elastic Search to ignore certain fields in a multi_match query?
If the answer to that is "no" then the follow up question is how to escape the search term so that it won't trigger key words
Running version 6.7
Try this:
Exclude a field on a Elasticsearch query
curl -XGET 'localhost:9200/testidx/items/_search?pretty=true' -d '{
"query" : {
"query_string": {
"fields": ["title", "field2", "field3"], <-- add this
"query": "Titulo"
}},
"_source" : {
"exclude" : ["*.body"]
}
}'
Apparently the answer is "No: there is not a way to tell ElasticSearch to ignore certain fields in a multi_match query"
For my particular issue I found an inexpensive way to find the necessary white-listed fields (this is performed outside the scope of ElasticSearch otherwise I would post it here) and list those in place of the "*" when building the query.
I am hopeful someone will tell me I'm wrong, but I don't think I am.

Elasticsearch: what my index contains: docs or positions?

I've created ES index using the following command:
curl -X PUT -H 'Content-Type: application/json' -H 'Accept: application/json' -d '{"settings" :{"number_of_shards" : 10, "number_of_replicas" : 0, "analysis":{"analyzer": {"my_analyzer": {"type": "custom", "tokenizer":"whitespace","filter":["lowercase","porter_stem"],"stopwords":[...stopwords here ...]}}}}, "mappings" : {"html" : {"properties" : "head" : { "type" : "text", "analyzer": "my_analyzer" }, "body" : { "type" : "text", "analyzer": "my_analyzer"}}}}}' localhost:9200/docs
I read here that:
Analyzed string fields use positions as the default, and all other fields use docs as the default.
Since my fields are of text type, are they considered string fields?
My main issue is how to know what does my index contain (docs or positions?) for each field! I used \docs\_settings command to get the index settings, but didn't get useful answer?
Any hints?
EDIT:
In addition answer of #ibexit below, I verified that practically by issuing phrase queries against ES indices.
You defined the fields as text, without specifying index_options in your mapping. In this case the default for text fields will be applied (index_options=positions). The inverse index will now contain doc number, term frequencies, and term positions (or order) for the text fields.
For more in depth information about inverted indices please have a look on https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up or https://youtu.be/x37B_lCi_gc
This should be a good starting point for your research.
Cheers!

How to enable fuzziness for phrase queries in ElasticSearch

We're using ElasticSearch for searching through millions of tags. Our users should be able to include boolean operators (+, -, "xy", AND, OR, brackets). If no hits are returned, we fall back to a spelling suggestion provided by ES and search again. That's our query:
$ curl -XGET 'http://127.0.0.1:9200/my_index/my_type/_search' -d '
{
"query" : {
"query_string" : {
"query" : "some test query +bools -included",
"default_operator" : "AND"
}
},
"suggest" : {
"text" : "some test query +bools -included",
"simple_phrase" : {
"phrase" : {
"field" : "my_tags_field",
"size" : 1
}
}
}
}
Instead of only providing a fallback to spelling suggestions, we'd like to enable fuzzy matching. If, for example, a user searches for "stackoverfolw", ES should return matches for "stackoverflow".
Additional question: What's the better performing method for "correcting" spelling errors? As it is now, we have to perform two subsequent requests, first with the original search term, then with the by ES suggested term.
The query_string does support some fuzziness but only when using the ~ operator, which I think doesn't your usecase. I would add a fuzzy query then and put it in or with the existing query_string. For instance you can use a bool query and add the fuzzy query as a should clause, keeping the original query_string as a must clause.
As for your additional question about how to correct spelling mistakes: I would use fuzzy queries to automatically correct them and two subsequent requests if you want the user to select the right correction from a list (e.g. Did you mean), but your approach sounds good too.

Is there a way to "escape" ElasticSearch stop words?

I am fairly new to ElasticSearch and have a question on stop words. I have an index that contains state names for the USA....ex: New York/NY, California/CA,Oregon/OR. I believe Oregon's abbreviation, 'OR' is a stop word, so when I insert the state data into the index, I cannot search on 'OR'. Is there a way I can set up custom stopwords for this or am I doing something wrong?
Here is how I am building the index:
curl -XPUT http://localhost:9200/test/state/1 -d '{"stateName": ["California","CA"]}'
curl -XPUT http://localhost:9200/test/state/2 -d '{"stateName": ["New York","NY"]}'
curl -XPUT http://localhost:9200/test/state/3 -d '{"stateName": ["Oregon","OR"]}'
A search for 'NY', works fine. Ex:
curl -XGET 'http://localhost:9200/test/state/_search?pretty=1' -d '
{
"query" : {
"match" : {
"stateName" : "NY"
}
}
}'
But a search for 'OR', returns zero hits:
curl -XGET 'http://localhost:9200/test/state/_search?pretty=1' -d '
{
"query" : {
"match" : {
"stateName" : "OR"
}
}
}'
I believe this search returns no results because OR is stop word, but I don't know how to work around this. Thanks for you help.
You can (and definitely should) control the way you index data by modifying your mapping according to your data and the way you want to search against it.
In your case I would disable stopwords for that specific field rather than modifying the stopword list, but you could do the latter too if you wish to. The point is that you're using the default mapping which is great to start with, but as you can see you need to tweak it depending on your needs.
For each field, you can specify what analyzer to use. An analyzer defines the way you split your text into tokens (tokenizer) that will be indexed and also additional changes you can make to each token (even remove or add new ones) using token filters.
You can specify your mapping either while creating your index or update it afterwards using the put mapping api (as long as the changes you make are backwards compatible).

Resources