how to query for phrases(shingles) in Elasticsearch - elasticsearch

I have the following string "Word1 Word2 StopWord1 StopWord2 Word3 Word4".
When I query for this string using ["bool"]["must"]["match"], I would like to return all text that matches "Word1Word2" and/or "Word3Word4".
I have created an analyzer that I would like to use for indexing and searching.
Using analyze API, I have confirmed that indexing is being done correctly. The shingles returned are "Word1Word2" and "Word3Word4"
I want to query so that text matching "Word1Word2" and/or "Word3Word4" are returned. How can I do this dynamically - meaning, I don't know up front how many shingles will be generated, so I don't know how many match_phrase to code up in a query.
"should":[
{ "match_phrase" : {"content": phrases[0]}},
{ "match_phrase" : {"content": phrases[1]}}
]

To query for shingles(and unigrams), you could set up your mappings to handle them cleanly in separate fields. In the example below, the field "shingles" will be used to analyze and retrieve shingles, while the implicit field will be used to handle unigrams.
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
}
}
PUT /my_index/_mapping/my_type
{
"my_type": {
"properties": {
"title": {
"type": "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
}
}
}
}
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": {
"match": {
"title": "<your query string>"
}
},
"should": {
"match": {
"title.shingles": "<your query string"
}
}
}
}
}
Ref. Elasticsearch: The Definitive Guide....

Related

Could I combine wildcard and fulltext search in Elasticsearch?

For example, I have some titles data in Elasticsearch likes this,
gamexxx_nightmare,
gamexxx_little_guy
Then I input
game => search out gamexxx_nightmare and gamexxx_little_guy
little guy => search out gamexxx_little_guy ?
first I think I will use a wildcard to make game match gamexxx, the second it is fulltext search?
How to combine them in one DSL??
While Jaspreet's answer is right but doesn't combine both the requirements in one query DSL as asked by OP in his question How to combine them in one DSL??.
It's an enhancement to Jaspreet's solution as I am also not using the wild-card and even avoiding the n-gram analyzer which is too costly(increases the index size) and requires re-indexing if requirement changes.
One Search query to combine both the requirement can be done as below:
Index mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"replace_underscore" -->note this
]
}
},
"char_filter": {
"replace_underscore": {
"type": "mapping",
"mappings": [
"_ => \\u0020"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer" : "my_analyzer"
}
}
}
}
Index your sample docs
{
"title" : "gamexxx_little_guy"
}
And
{
"title" : "gamexxx_nightmare"
}
Single Search query
{
"query": {
"bool": {
"must": [ --> note this
{
"bool": {
"must": [
{
"prefix": {
"title": {
"value": "game"
}
}
}
]
}
},
{
"bool": {
"must": [
{
"match": {
"title": {
"query": "little guy"
}
}
}
]
}
}
]
}
}
}
Result
{
"_index": "so-46873023",
"_type": "_doc",
"_id": "2",
"_score": 2.2814486,
"_source": {
"title": "gamexxx_little_guy"
}
}
Important points:
The first part of the query is prefix query, which would match the game in both the documents. (This would avoid costly regex).
The second part is allowing the full-text search, to enable this, I used custom analyzer which replaces the _ with whitespace, so you don't need expensive (n-grams in index) and simple match query would fetch the results.
Above query, returns result matching both the criteria, you can change the high level, bool clause to should from must if, you want to return matching any criteria.
NGrams have better performance than wildcards. For wild card all documents have to be scanned to see which match the pattern. Ngrams break a text in small tokens. Ex Quick Foxes will stored as [ Qui, uic, ick, Fox, oxe, xes ] depending on min_gram and max_gram size.
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"text":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Query
GET my_index/_search
{
"query": {
"match": {
"text": "little guy"
}
}
}
If you want to go with wildcard only then you can search on not_analyzed string. This will handle spaces between words
"wildcard": {
"text.keyword": {
"value": "*gamexxx*"
}
}

not able to search in compounding query using analyzer

I have a problem index which has multiple fields e.g tags (comma separated string of tags), author, tester. I am creating a global search where problems can be searched by all these fields at once.
I am using boolean query
e.g
{
"query": {
"bool": {
"must": [{
"match": {
"author": "author_username"
}
},
{
"match": {
"tester": "tester_username"
}
},
{
"match": {
"tags": "<tag1,tag2>"
}
}
]
}
}
}
Without Analyzer I am able to get the results but it uses space as separator e.g python 3 is getting searched as python or 3.
But I wanted to search Python 3 as single query. So, I have created an analyzer for tags so that every comma-separated tag is considered as one, not by standard whitespace.
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": ","
}
}
}
},
"mappings": {
"properties": {
"tags": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
But now I am not getting any results. Please let me know what I am missing here. I am not able to find the use of analyzer in compound queries in the documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/compound-queries.html
Adding an example:
{
"query": {
"bool": {
"must": [{
"match": {
"author": "test1"
}
},
{
"match": {
"tester": "test2"
}
},
{
"match": {
"tags": "test3, abc 4"
}
}
]
}
}
}
Results should match all the fields but for the tags field there should be a union of tags and query should be comma-separated not by space. i.e query should match test and abc 4 but above query searching for test, abc and 4.
You need to either remove search_analyzer from your mapping or pass my_analyzer in match query
GET tags/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"tags": {
"query": "python 3",
"analyzer": "my_analyzer" --> by default search analyzer is used
}
}
}
]
}
}
}
By default, queries will use the analyzer defined in the field mapping, but this can be overridden with the search_analyzer setting.

Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram

Suppose there is the following mapping with Edge NGram Tokenizer:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
And the following documents are indexed:
POST /tag/tag/_bulk
{"index":{}}
{"name" : "HITS FIND SOME"}
{"index":{}}
{"name" : "TRENDING HI"}
{"index":{}}
{"name" : "HITS OTHER"}
Then searching
{
"query": {
"match": {
"name": {
"query": "HI"
}
}
}
}
yields all with the same score, or TRENDING - HI with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {}
}
}
Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
{
"query": {
"bool": {
"must" : [
{
"match": {
"name": "HI"
}
}
],
"should": [
{
"prefix": {
"name": "HI"
}
}
]
}
},
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {
"highlight_query": {
"match": {
"name": "HI"
}
}
}
}
}
}
In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "HI"
}
},
{
"match_phrase_prefix": {
"name": "HI"
}
}
]
}
}
}
The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.
A possible solution for this problem is to use multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query comparing with all those different fields.
The final score of documents is composed of the match value for each one. Those matches are also called signals, signalling that there is a match between the query and the document. The document with most signals matching gets the highest score.
In your case all documents would match the ngram HI. But only the HITS FIND SOME and the HITS OTHER document would get the edgengram additional score. This would give those two documents a boost and put them on top. The complication with this is that you have to make sure that the edgengram doesn't split on whitespaces, because then the HI at the end would get the same score as in the beginning of the document.
Here is an example mapping and query for your case:
PUT /tag/
{
"settings": {
"analysis": {
"analyzer": {
"edge_analyzer": {
"tokenizer": "edge_tokenizer"
},
"kw_analyzer": {
"tokenizer": "kw_tokenizer"
},
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer"
},
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"kw_tokenizer": {
"type": "keyword"
},
"edge_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
},
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
},
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"fields": {
"edge": {
"type": "text",
"analyzer": "edge_analyzer"
},
"ngram": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}
}
}
}
}
And a query:
POST /tag/_search
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"match": {
"name.edge": {
"query": "HI"
}
}
},
"boost": "5",
"boost_mode": "multiply"
}
},
{
"match": {
"name.ngram": {
"query": "HI"
}
}
},
{
"match": {
"name": {
"query": "HI"
}
}
}
]
}
}
}

Custom analyzer, use case : zip-code [ElasticSearch]

Let be a set index/type named customers/customer.
Each document of this set has a zip-code as property.
Basically, a zip-code can be like:
String-String (ex : 8907-1009)
String String (ex : 211-20)
String (ex : 30200)
I'd like to set my index analyzer to get as many documents as possible that could match. Currently, I work like that :
PUT /customers/
{
"mappings":{
"customer":{
"properties":{
"zip-code": {
"type":"string"
"index":"not_analyzed"
}
some string properties ...
}
}
}
When I search a document I'm using that request :
GET /customers/customer/_search
{
"query":{
"prefix":{
"zip-code":"211-20"
}
}
}
That works if you want to search rigourously. But for instance if the zip-code is "200 30", then searching with "200-30" will not give any results.
I'd like to give orders to my index analyser in order to don't have this problem.
Can someone help me ?
Thanks.
P.S. If you want more information, please let me know ;)
As soon as you want to find variations you don't want to use not_analyzed.
Let's try this with a different mapping:
PUT zip
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"zip_code": {
"tokenizer": "standard",
"filter": [ ]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_code"
}
}
}
}
}
We're using the standard tokenizer; strings will be broken up at whitespaces and punctuation marks (including dashes) into tokens. You can see the actual tokens if you run the following query:
POST zip/_analyze
{
"analyzer": "zip_code",
"text": ["8907-1009", "211-20", "30200"]
}
Add your examples:
POST zip/_doc
{
"zip": "8907-1009"
}
POST zip/_doc
{
"zip": "211-20"
}
POST zip/_doc
{
"zip": "30200"
}
Now the query seems to work fine:
GET zip/_search
{
"query": {
"match": {
"zip": "211-20"
}
}
}
This will also work if you just search for "211". However, this might be too lenient, since it will also find "20", "20-211", "211-10",...
What you probably want is a phrase search where all the tokens in your query need to be in the field and also in the right order:
GET zip/_search
{
"query": {
"match_phrase": {
"zip": "211"
}
}
}
Addition:
If the ZIP codes have a hierarchical meaning (if you have "211-20" you want this to be found when searching for "211", but not when searching for "20"), you can use the path_hierarchy tokenizer.
So changing the mapping to this:
PUT zip
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"zip_code": {
"tokenizer": "zip_tokenizer",
"filter": [ ]
}
},
"tokenizer": {
"zip_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_code"
}
}
}
}
}
Using the same 3 documents from above you can use the match query now:
GET zip/_search
{
"query": {
"match": {
"zip": "1009"
}
}
}
"1009" won't find anything, but "8907" or "8907-1009" will.
If you want to also find "1009", but with a lower score, you'll have to analyze the zip code with both variations I have shown (combine the 2 versions of the mapping):
PUT zip
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"zip_hierarchical": {
"tokenizer": "zip_tokenizer",
"filter": [ ]
},
"zip_standard": {
"tokenizer": "standard",
"filter": [ ]
}
},
"tokenizer": {
"zip_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"zip": {
"type": "text",
"analyzer": "zip_standard",
"fields": {
"hierarchical": {
"type": "text",
"analyzer": "zip_hierarchical"
}
}
}
}
}
}
}
Add a document with the inverse order to properly test it:
POST zip/_doc
{
"zip": "1009-111"
}
Then search both fields, but boost the one with the hierarchical tokenizer by 3:
GET zip/_search
{
"query": {
"multi_match" : {
"query" : "1009",
"fields" : [ "zip", "zip.hierarchical^3" ]
}
}
}
Then you can see that "1009-111" has a much higher score than "8907-1009".

wildcard on different tokens in elastic search

I have a document which looks like this
Name
Thomy tyson Olando Magua
Using ngram i was able to acheive the wildcard search so that if i type in omy tyson it can return me the above document pretty much similar to this sql query
select name from table where name like '%omy tyson%'
PUT sample
{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "15"
}
}
}
},
"mappings": {
"typename": {
"properties": {
"name": {
"type": "string",
"fields": {
"search": {
"type": "string",
"analyzer": "my_ngram_analyzer"
}
}
}
}
}
}
}
PUT sample/typename/2
{
"name": "Thomy tyson Olando Magua"
}
{
"query": {
"bool": {
"should": [
{
"term": {
"name.search": "omy tyson"
}
}
]
}
}
}
Is there a way in elastic search where i can perform wildcard search on 2 different words separated by other words like
select name from table where name like '%omy Magua%'
So in this case i would like to perform partial search on first and fourth word.
Any feedback would be helpfull

Resources