Elasticsearch: search_as_you_type datatype vs. tokenizer edge_ngram

Elasticsearch: search_as_you_type datatype vs. tokenizer edge_ngram - elasticsearch

What is the difference between new search_as_you_type datatype in Elasticsearch and tokenizer type edge_ngram? Which one to prefer in building search-as-you-type search engine?
Documentation of Elasticsearch gives both implementations:
search_as_you_type datatype: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html
tokenizer type edge_ngram: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html (Look at the example of how to set up a field for search-as-you-type.)
UPDATE
Elasticsearch version : 7.6.1
I indexed my data with a data type search_as_you_type according to the latest Elasticsearch documentation and trying to build a simple query via Java API based on the example below:
GET my_index/_search
{
"query": {
"multi_match": {
"query": "brown f",
"type": "bool_prefix",
"fields": [
"my_field",
"my_field._2gram",
"my_field._3gram"
]
}
}
}
The point that I struggle with is adding "type": "bool_prefix".
A) I tried with MultiMatchQueryBuilder
MultiMatchQueryBuilder multiMatchQueryBuilder=new MultiMatchQueryBuilder(value, fields);
multiMatchQueryBuilder.type(MatchQuery.Type.BOOLEAN_PREFIX);
and got an exception at the second line of above code:
org.elasticsearch.ElasticsearchParseException: failed to parse [multi_match] query type [boolean_prefix]. unknown type.
B) Then I tried with MatchBoolPrefixQueryBuilder
MatchBoolPrefixQueryBuilder matchBoolPrefixQueryBuilder=new MatchBoolPrefixQueryBuilder(value, fields);
got an exception
org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=parsing_exception, reason=[match_bool_prefix] unknown token [START_ARRAY] after [query]]
...
Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [http://localhost:9200], URI [/my_dictionary/_search?pre_filter_shard_size=128&typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"parsing_exception","reason":"[match_bool_prefix] unknown token [START_ARRAY] after [query]","line":1,"col":57}],"type":"parsing_exception","reason":"[match_bool_prefix] unknown token [START_ARRAY] after [query]","line":1,"col":57},"status":400}
at line
SearchResponse searchResponse=restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
What am I doing wrong? Which one should I use and how?
SOLUTION
I solved the issue just by changing the type to:
MultiMatchQueryBuilder multiMatchQueryBuilder=new MultiMatchQueryBuilder(value, fields);
multiMatchQueryBuilder.type("bool_prefix");
But I don't understand why the type must be hardcoded as "bool_prefix" instead of using MatchQuery.Type.BOOLEAN_PREFIXor why not possible to use MatchBoolPrefixQueryBuilder, there is no much implementation examples of this query.

The two are different things.
edge_ngram is a tokenizer, which means it kicks in at indexing time to tokenize your input data. There is also a edge_ngram token filter. Both are similar but work at different levels. See this thread to learn about the main differences.
search_as_you_type is a field type which contains a few sub-fields, one of which is called _index_prefix and which leverages the edge_ngram tokenizer.
So basically, what you see in the edge_ngram tokenizer documentation has actually been leveraged when they decided to add the new search_as_you_type field type.
UPDATE
You actually need to use
MultiMatchQueryBuilder multiMatchQueryBuilder=new MultiMatchQueryBuilder(value, fields);
multiMatchQueryBuilder.type(MultiMatchQueryBuilder.Type.BOOL_PREFIX);
You can see here how that enumeration value is built

Related

Elasticseach query filter/term not working when special characters are involved

The following query is not working when "metadata.name" has "-" in the text like "demo-application-child3" . But if I remove "-" and make the query to "demoapplicationchild3". It works. The same with other field metadata.version. I've the data for both demoapplicationchild3 and demo-application-child3. suggestions please.
{
"query": {
"bool": {
"filter": [
{"term": { "metadata.name": "demo-application-child3" }},
{"term": { "metadata.version": "00.00.100" }}]
}
}
}

term queries are not analyzed see the official doc which clearly mention this
Returns documents that contain an exact term in a provided field.
Which clearly means that index time you are using some custom analyzer which is removing - and joining the tokens ie for demo-application-child3 your custom analyzer would be generating demoapplicationchild3 token, which you can easily confirm using the Analyze api.
If you want to get result either change term query to match query or use the .keyword suffix with your field if mappping is generated dynamically or create another field which is of type keyword which uses no-op analyzer.

What is the best way to map following unstructured data in elastic search?

I am trying to figure out what could be the best type and anlyzer for a field which has unstructured data.
request field could be of following and many other
{"_format":"json","follow":{"followee":27}} //nested objects
[{"q": "madhab"}] //array of objects
?q=madhab //string
i have tried making this field text with simple analyzer
"request": {
"type": "text",
"analyzer": "simple"
},
Plus: i wonder if there is any online tool which can help to visualize how elastic search tokenize the data with given analyzers, filters

Elastic search gives you an option to see how the text has been tokenized under various analyzers. You can use Kibana or any REST client to see the response for such request:
GET /_analyze
{
"analyzer": "standard",
"text": "Text to analyze"
}
https://www.elastic.co/guide/en/elasticsearch/guide/master/analysis-intro.html
This will give you fair idea what is missing in your schema w.r.t your queries.

Achieve functionality like snowball analyzer through normalizer on keyword type - ELASTICSEARCH 5.6

I have been trying to implement snowball analyzer like functionality on one of my doc fields which is of type keyword. Like, for example plurals should be treated exactly like their singulars so that results are same for both.
Initially, I struggled setting analyzer on my field just to discover that fields of type keyword cannot have analyzers but normalizers. So, I tried setting a normalizer for snowball on those fields but it seems like my normalizer is not allowing the snowball filter (may be normalizers don't support the snowball filter)
I can't change the type of the field. I want to achieve functionality like if my input text matches restaurants it should treat it same as restaurant and give the results so that I don't have to add restaurants as a keyword to that field.
Can we achieve this through normalizers? I have gone through the elastic documentations and various posts but got no clue. Below is how I tried setting normalizer with the response from my elastic server.
PUT localhost:9200/db110/_settings
{
"analysis": {
"normalizer": {
"snowball_normalizer": {
"filter": ["lowercase","snowball" ]
}
},
"filter" : {
"snow" : {
"type" : "snowball",
"language" : "English"
}
}
}
}
Response
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Custom normalizer [snowball_normalizer] may not use filter [snowball]"
}
],
"type": "illegal_argument_exception",
"reason": "Custom normalizer [snowball_normalizer] may not use filter [snowball]"
},
"status": 400
}

you can't do that! Snowball is a stemmer and it is used for fulltext search - e.g. text datatype, because it is a token filter, that manipulates every single token. With keyword datatype you create a single token for all the content of the field. How stemmer could works for keyword field, according you? Use stemmer without tokens has no sense. Normalizer for keyword fields are only lowercase and asciifolding. Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/normalizer.html

Elasticsearch: Use match query along with autocomplete

I want to use match query along with autocomplete suggestion in ES5. Basically I want to restrict my autocomplete result based on an attribute, like autocomplete should return result within a city only.
MatchQueryBuilder queryBuilder = QueryBuilders.matchQuery("cityName", city);
SuggestBuilder suggestBuilder = new SuggestBuilder()
.addSuggestion("region", SuggestBuilders.completionSuggestion("region").text(text));
SearchResponse response = client.prepareSearch(index).setTypes(type)
.suggest(suggestBuilder)
.setQuery(queryBuilder)
.execute()
.actionGet();
The above doesn't seem to work correctly. I am getting both the results in the response both independent of each other.
Any suggestion?

It looks like the suggestion builder is creating a completion suggester. Completion suggesters are stored in a specialized structure that is separate from the main index, which means it has no access to your filter fields like cityName. To filter suggestions you need to explicitly define those same filter values when you create the suggestion, separate to the attributes you are indexing for the document to which the suggestion is attached. These suggester filters are called context. More information can be found in the docs.
The docs linked to above are going to explain this better than I can, but here is a short example. Using a mapping like the following:
"auto_suggest": {
"type": "completion",
"analyzer": "simple",
"contexts": [
{
"name": "cityName",
"type": "category",
"path": "cityName"
}
]
}
This section of the index settings defines a completion suggester called auto_suggest with a cityName context that can be used to filter the suggestions. Note that the path value is set, which means this context filter gets its value from the cityName attribute in your main index. You can remove the path value if you want to explicitly set the context to something that isn't already in the main index.
To request suggestions while providing context, something like this in combination with the settings above should work:
"suggest": {
"auto_complete":{
"text":"Silv",
"completion": {
"field" : "auto_suggest",
"size": 10,
"fuzzy" : {
"fuzziness" : 2
},
"contexts": {
"cityName": [ "Los Angeles" ]
}
}
}
}
Note that this request also allows for fuzziness, to make it a little resilient to spelling mistakes. It also restricts the number of suggestions returned to 10.
It's also worth noting that in ES 5.x completion suggester are document centric, so if multiple documents have the same suggestion, you will receive duplicates of that suggestion if it matches the characters entered. There's an option in ES 6 to de-duplicate suggestions, but nothing similar in 5.x. Again it's best to think of completion suggesters existing in their own index, specifically an FST, which is explained in more detail here.

In Elasticsearch match query how to deal with slash

I have a match query searching for a type of doc:
{
"query": {
"bool": {
"should": {
"match": {
"ph1_enc": "EAAQnb1kMr/e2/ADqo"
}
}
}
}
}
"EAAQnb1kMr/e2/ADqo" is the string i'm trying to match, however in the search results I can see multiple records with substring "/e2/" are also returned.
Looks like "/e2/" is indexed separately, so that this could happen.I thought the match query is to do full-text match... Is it because I missed something when creating the template? Any idea?
Add-on instead of reindex, how to modify the query to match the exact value in the query?

Which analyzer do you set in the mapping to index your data?
If you are using the default one (standard analyzer), then according to the documentation, this uses the default tokenizer that seems to split also the text by slash ('/'). The documentation redirects here for more information about the tokenizer.
So, that will index the following words 'EAAQnb1kMr', 'e2', and 'ADqo'. Accordingly, your query value will also been analyzed the same way the field was indexed. That is why documents with 'e2' are also being returned.
If you don't need to tokenize the 'ph1_enc' field, you can just set its type in the mapping as 'keyword'.
"properties": {
"ph1_enc": {
"type": "keyword"
}
}
That will not analyze the field and it will match exactly while you query.
I hope that it helps.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio