Spring data elastic search wild card search - elasticsearch

I am trying to search for the word blue in the below list of text
"BlueSaphire","Bluo","alue","blue", "BLUE",
"Blue","Blue Black","Bluo","Saphire Blue",
"black" , "green","bloo" , "Saphireblue"
SearchQuery searchQuery = new NativeSearchQueryBuilder().withIndices("color")
.withQuery(matchQuery("colorDescriptionCode", "blue")
.fuzziness(Fuzziness.ONE)
)
.build();
This works fine and the search result returns the below records along with the scores
alue 2.8718023
Bluo 1.7804208
Bluo 1.7804208
BLUE 1.2270637
blue 1.2270637
Blue 1.2270637
Blue Black 1.1082436
Saphire Blue 0.7669148
But I am not able to make wild card work . "SaphireBlue" and "BlueSaphire" is also expected to be part of the result
I tried the below setting but it does not work .
SearchQuery searchQuery = new NativeSearchQueryBuilder().withIndices("color")
.withQuery(matchQuery("colorDescriptionCode", "(.*?)blue")
.fuzziness(Fuzziness.ONE)
)
.build();
In stack overflow , I observed a solution to specify analyze wild card .
QueryBuilder queryBuilder = boolQuery().should(
queryString("blue").analyzeWildcard(true)
.field("colorDescriptionCode", 2.0f);
I dont find the queryString static method . I am using spring-data-elasticsearch 2.0.0.RELEASE .
Let me know how i can specify the wild card so the all words containing blue will also be returned in the search results

I know that working examples are always better than theory, but still, I would first like to tell a little theory. The heart of the Elasticsearch is Lucene. So before document will be written to Lucene index, he goes through analysis stage. The analysis stage can be divided into 3 parts:
char filtering;
tokenizing;
token filtering
In the first stage, we can throw away unwanted characters, for example, HTML tags. More information about character filters, you can find on official site.
Next stage is far more interesting. Here we split input text to tokens, which will be used later for searching. A few very useful tokenizers:
standard tokenizer. It's used by default. The tokenizer implements the Unicode Text Segmentation algorithm. In practice, you can use this to split the text into words and use this words as tokens.
n-gram tokenizer. This is what you need if you want to search by part of the word. This tokenizer splits text to a contiguous sequence of n items. For example text "for example" will be splitted to this sequence of tokens "fo", "or", "r ", " e", "ex", "for", "or ex" etc. The length of n-gram is variable and can be configured by min_gram and max_gram params.
edge n-gram tokenizer. Work the same as n-gram tokenizer except for one thing - this tokenizer doesn't increment offset. For example text "for example" will be splitted to this sequence of tokens "fo", "for", "for ", "for e", "for ex", "for exa" etc.
More information about tokenizers you can find on the official site. Unfortunately, I can't post more links because of low reputation.
The next stage is also damn interesting. After we split text to tokens, we can do a lot of interesting things with this. Again I give a few very useful examples of token filters:
lowercase filter. In most cases, we want to get case-insensitive search, so it's good practice to bring tokens to lowercase.
stemmer filter. When we have a deal with natural language, we have a lot of problems. One of the problem is that one word can have many forms. Stemmer filter helps us to get root form of the word.
fuzziness filter. Another problem is that users often make typos. This filter adds tokens that contain possible typos.
If you are interested in looking at the result of the analyzer, you can use this _termvectors endpoint
curl [ELASTIC_URL]:9200/[INDEX_NAME]/[TYPE_NAME]/[DOCUMENT_ID]/_termvectors?pretty
Now talk about queries. Queries are divided into 2 large groups. These groups have 2 significant differences:
Whether the request will go through the analysis stage or not;
Does the request require an exact answer (yes or no)
Examples are the match query and term query. The first will pass the stage of analysis, the second not. The first will not give us a specific answer (but give us a score), the second will does. When creating mappings for a document, we can specify both the index of the analyzer and the search analyzer separately per field.
Now information regarding spring data elasticsearch. Here it makes sense to talk about concrete examples. Suppose that we have a document with a title field and we want to search for information on this field. First, create a file with settings for elasticsearch.
{
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
},
"edge_ngram_analyzer": {
"tokenizer": "edge_ngram_tokenizer",
"filter": [
"lowercase"
]
},
"english_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"english_stop",
"unique",
"english_possessive_stemmer",
"english_stemmer"
]
"keyword_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20
},
"edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
You can save this settings to your resource folder. Now let's see to our document class
#Document(indexName = "document", type = "document")
#Setting(settingPath = "document_index_setting.json")
public class Document {
#Id
private String id;
#MultiField(
mainField = #Field(type = FieldType.String,
index = not_analyzed),
otherFields = {
#InnerField(suffix = "edge_ngram",
type = FieldType.String,
indexAnalyzer = "edge_ngram_analyzer",
searchAnalyzer = "keyword_analyzer"),
#InnerField(suffix = "ngram",
type = FieldType.String,
indexAnalyzer = "ngram_analyzer"),
searchAnalyzer = "keyword_analyzer"),
#InnerField(suffix = "english",
type = FieldType.String,
indexAnalyzer = "english_analyzer")
}
)
private String title;
// getters and setters omitted
}
So here field title with three inner fields:
title.edge_ngram for searching by edge n-grams with keyword search analyzer. We need this because we don't need that our query be splitted to edge n-grams;
title.ngram for searching by n-grams;
title.english for searching with the nuances of a natural language
And main field title. We don't analyze this because sometimes we want to sort by this field.
Let's use simple multi match query for searching through all this fields:
String searchQuery = "blablabla";
MultiMatchQueryBuilder queryBuilder = multiMatchQuery(searchQuery)
.field("title.edge_ngram", 2)
.field("title.ngram")
.field("title.english");
NativeSearchQueryBuilder searchBuilder = new NativeSearchQueryBuilder()
.withIndices("document")
.withTypes("document")
.withQuery(queryBuilder)
.withPageable(new PageRequest(page, pageSize));
elasticsearchTemplate.queryForPage(searchBuilder.build,
Document.class,
new SearchResultMapper() {
//realisation omitted });
Search is a very interesting and voluminous topic. I tried to answer as briefly as possible, it is possible that because of this there were confusing moments - do not hesitate to ask.

I could not achieve Fuzziness and Wilcard search in one query.
This is the closest solution I could get. I had to fire two different queries and merge the results manually .
#Query("{\"wildcard\" : {\"colorDescriptionCode\" : \"?0\" }}")
Page<ColorDescription> findByWildCard(String colorDescriptionCode, Pageable pageable);
#Query("{\"match\": { \"colorDescriptionCode\": { \"query\": \"?0\", \"fuzziness\": 1 }}}")
Page<ColorDescription> findByFuzzy(String colorDescriptionCode, Pageable pageable);

Related

Position aware search results in Elasticsearch automcompletion

I want to implement address autocompletion using Elasticsearch.
The current approach I am investigating is based on search_as_you_type field type.
Consider this two addresses:
3543JN Carl Zellerhof 8 Utrecht (3543JN is postcode)
1234JN The Street 3543 Utrecht
It is important to prioritize some address parts over others, for instance, postcode should have more weight than number, eg when a user types 3543 - the first address should be first in search results.
I see two solutions here:
Combine address into one string and give weight based on position within the combined string
Do search on multiple fields (then weight can be adjusted per field, but it seems more complex to me, how to ensure the same address part is not matched several times?)
I am leaning more towards one-string solution, but this implementation gives the same weight for the 3543 search query.
Please advise how to implement this.
(It is also desirable to allow some fuzziness)
UPD:
seems adding postcode field to the multi_match fields gives me what I want. Are there any disadvantages of this approach?
the index
{
"mappings": {
"properties": {
"search": {
"type": "search_as_you_type"
}
}
}
}
the search query
{
"query": {
"multi_match": {
"query": "3543",
"type": "bool_prefix",
"fields": [
"search",
"search._2gram",
"search._3gram"
]
}
}
}

Elasticsearch Suggest+Synonyms+fuzziness

I am looking for a way to implement the auto-suggest with synonyms & fuzziness
For example, when the user tried to search for "replce ar"
My synonym list has ar => audio record
So, the result should include the items matching
changing audio record
replacing audio record
etc..,
Here we need fuzziness because there is a typo on "replace" (in the user's search text)
Synonyms to match ar => audio record
Auto-suggest with regex pattern.
Is it possible to implement all the three features in a single field?
Edit:
a regex+fuzzy just throws error.
I haven't well explained my need of a regex-pattern.
so, i needed a Regex for doing a partial word lookup ('encyclopedic' contains 'cyclo').
now, after investigating what options do i have for this purpose, directing me to the NGram Tokenizer and looking into the other suggesters, i found that maybe Phrase suggester is realy what I'm looking for, so I'll try it & tell you about.
Yes, you can use synonyms as well as fuzziness for suggestions. The synonyms are handled by adding a synonym filter in your language analyzer and adding that filter to the analyzer. Then, when you create the field mapping for the field(s) you want to use for suggestions, you assign that analyzer to that field.
As for fuzziness, that happens at query time. Most text-based queries support a fuzziness option which allows you to specify how many corrections you want to allow. The default auto value adjusts the number of corrections, depending on how long the term is, so that's usually best.
Notional analysis setup (synonym_graph reference)
{
"analysis": {
"filter": {
"synonyms": {
"type": "synonym_graph",
"expand": "false",
"synonyms": [
"ar => audio record"
]
}
},
"analyzer": {
"synonyms": {
"tokenizer": "standard",
"type": "custom",
"filter": [
"standard",
"lowercase",
"synonyms"
]
}
}
}
}
Notional Field Mapping (Analyzer + Mapping reference)
(Note that the analyzer matches the name of the analyzer defined above)
{
"properties": {
"suggestion": {
"type": "text",
"analyzer": "synonyms"
}
}
}
Notional Query
{
"query": {
"match": {
"suggestion": {
"query": "replce ar",
"fuzziness": "auto",
"operator": "and"
}
}
}
}
Keep in mind that there are several different options for suggestions, so depending on which option you use, you may need to adjust the way the field is mapped, or even add another token filter to the analyzer. But analyzers are just made up of a series of token filters, so you can usually combine whatever token filters you need to achieve your goal. Just make sure you understand what each filter is doing so you get the filters in the correct order.
If you get stuck in part of this process, just submit another question with the specific issue you're running into. Good luck!

In Elasticsearch, how do I search for an arbitrary substring?

In Elasticsearch, how do I search for an arbitrary substring, perhaps including spaces? (Searching for part of a word isn't quite enough; I want to search any substring of an entire field.)
I imagine it has to be in a keyword field, rather than a text field.
Suppose I have only a few thousand documents in my Elasticsearch index, and I try:
"query": {
"wildcard" : { "description" : "*plan*" }
}
That works as expected--I get every item where "plan" is in the description, even ones like "supplantation".
Now, I'd like to do
"query": {
"wildcard" : { "description" : "*plan is*" }
}
...so that I might match documents with "Kaplan isn't" among many other possibilities.
It seems this isn't possible with wildcard, match prefix, or any other query type I might see. How do I simply search on any substring? (In SQL, I would just do description LIKE '%plan is%')
(I am aware any such query would be slow or perhaps even impossible for large data sets.)
Have you tried the regxp query in elasticsearch? It sure does sound like something you might be interested in.
I was hoping there might be something built-in for this Elasticsearch, given that this simple substring search seems like a very basic capability (Thinking about it, it is implemented as strstr() in C, LIKE '%%' in SQL, Ctrl+F in most text editors, String.IndexOf in C#, etc.), but this seems not to be the case. Note that the regexp query doesn't support case insensitivity, so I also needed to pair it with this custom analyzer, so that the index matches all-lowercase. Then I can convert my search string to lowercase as well.
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
...
"description": {"type": "text", "analyzer": "lowercase_keyword"},
}
}
Example query:
"query": {
"regexp" : { "description" : ".*plan is.*" }
}
Thanks to Jai Sharma for leading me; I just wanted to provide more detail.

How to match parts of a word in elasticsearch?

How can I match parts of a word to the parent word ?. For example: I need to match "eese" or "heese" to the word "cheese".
The best way to achieve this is using an edgeNGram token filter combined with two reverse token filters. So, first you need to define a custom analyzer called reverse_analyzer in your index settings like below. Then you can see that I've declared a string field called your_field with a sub-field called suffix which has our custom analyzer defined.
PUT your_index
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase", "reverse", "substring", "reverse"]
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"your_field": {
"type": "string",
"fields": {
"suffix": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
}
Then you can index a test document with "cheese" inside, like this:
PUT your_index/your_type/1
{"your_field": "cheese"}
When this document is indexed, the your_field.suffix field will contain the following tokens:
e
se
ese
eese
heese
cheese
Under the hood what is happening when indexing cheese is the following:
The keyword tokenizer will tokenize a single token, => cheese
The lowercase token filter will put the token in lowercase => cheese
The reverse token filter will reverse the token => eseehc
The substring token filter will produce different tokens of length 1 to 10 => e, es, ese, esee, eseeh, eseehc
Finally, the second reverse token filter will reverse again all tokens => e, se, ese, eese, heese, cheese
Those are all the tokens that will be indexed
So we can finally search for eese (or any suffix of cheese) in that sub-field and find our match
POST your_index/_search
{
"query": {
"match": {
"your_field.suffix": "eese"
}
}
}
=> Yields the document we've just indexed above.
You can do it two ways:
If you need it happen only for some search then search box only you can pass
*eese* or *heese*
Just give * in beginning and end of your search word. If you need it for every search
string "*#{params[:query]}*"
this will match with your parent word and give the result
There are multiple ways to do this
The analyzer approach - Here you Ngram tokenizer to break sub tokens of all the words. Hence for the word "cheese" -> [ "chee" , "hees" , "eese" , "cheese" ] and all ind of substrings would be generated. With this index size will go high , but the search speed would be optimized
The wildcard query approach - In this approach , a scan happens on the reverse index. This does not occupy additional index size , but it will take more time on the search.

problems with phrase matching in elasticsearch

I'm trying to perform Phrase matching using elasticsearch.
Here is what I'm trying to accomplish:
data - 1: {
"test" {
"title" : "text1 text2"
}
}
2: {
"test" {
"title" : "text3 text4"
}
}
3: {
"test" {
"title" : "text5"
}
}
4: {
"test" {
"title" : "text6"
}
}
Search terms:
If I lookup for "text0 text1 text2 text3" - It should return #1 (matches full string)
If I lookup for "text6 text5 text4 text3" - It should return #4, #3, but not #2 as its not in same order.
Here is what I've tried:
set the index_analyzer as keyword, and search_analyzer as standard
also tried creating custom tokens
but none of my solution allows me to lookup a substring match from search query against keyword in document.
If anyone has written similar queries, can you provide how the mappings are configured and what kind of query is been used.
What I see here is this: You want your search to match on any tokens sent from the query. If those tokens do match, it must be an exact match to the title.
This means that indexing your title field as keyword would get you that mandatory match. However, the standard analyzer for search would never match titles spaces as you'd have your index token {"text1 text2"} and your search token [{"text1},{"text2"}]. You can't use a phrase match with any sloppy value or else your token order requirement will be ignored.
So, what you really need is to generate keyword tokens during the index, but you need to generate shingles whenever you search. Your shingles will maintain order and if one of them matches, consider it a go. I would set to not output unigrams, but do allow unigrams if no shingles. This means that if you have just one word, it will output that token, but it if can combine your search words into various number of shingled tokens, it will not emit single word tokens.
PUT
{ "settings":
{
"analysis": {
"filter": {
"my_shingle": {
"type": "shingle",
"max_shingle_size": 50,
"output_unigrams": false
}
},
"analyzer": {
"my_shingler": {
"filter": [
"lowercase",
"asciifolding",
"my_shingle"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
}
}
Then you just want to set your type mapping to use the keyword analyzer for index and the `my_shingler` analyzer for search.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html

Resources