Is Simple Query Search compatible with shingles? - elasticsearch

I am wondering if it is possible to use shingles with the Simple Query String query. My mapping for the relevant field looks like this:
{
"text_2": {
"type": "string",
"analyzer": "shingle_analyzer"
}
}
The analyzer and filters are defined as follows:
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "custom_delimiter", "lowercase", "stop", "snowball", "filter_shingle"]
}
},
"filter": {
"filter_shingle":{
"type":"shingle",
"max_shingle_size":5,
"min_shingle_size":2,
"output_unigrams":"true"
},
"custom_delimiter": {
"type": "word_delimiter",
"preserve_original": True
}
}
I am performing the following search:
{
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"analyzer": "shingle_analyzer",
"fields": [
"text_2"
],
"lenient": "false",
"default_operator": "and",
"query": "porsches small red"
}
}
]
}
}
}
Now, I have a document with text_2 = small red porsches. Since I am using the AND operator, I would expect my document to NOT match, since the above query should produce a shingle of "porsches small red", which is a different order. However, when I look at the match explanation I am only seeing the single word tokens "red" "small" "porsche", which of course match.
Is SQS incompatible with shingles?

The answer is "Yes, but...".
What you're seeing is normal given the fact that the text_2 field probably has the standard index analyzer in your mapping (according to the explanation you're seeing), i.e. the only tokens that have been produced and indexed for small red porsches are small, red and porsches.
On the query side, you're probably using a shingle analyzer with output_unigrams set to true (default), which means that the unigram tokens will also be produced in addition to the bigrams (again according to the explanation you're seeing). Those unigrams are the only reason why you get matches at all. If you want to match on bigrams, then one solution is to use the shingle analyzer at indexing time, too, so that bigrams small red and red porsches can be produced and indexed as well in addition to the unigrams small, red and porsches.
Then at query time, the unigrams would match as well but small red bigram would definitely match, too. In order to only match on the bigrams, you can have another shingle analyzer just for query time whose output_unigrams is set to false, so that only bigrams get generated out of your search input. And in case your query only contains one single word (e.g. porsches), then that shingle analyzer would only generate a single unigram (because output_unigrams_if_no_shingles is true) and the query would still match your document. If that's not desired you can simply set output_unigrams_if_no_shingles to false in your shingle search analyzer.

Related

Elasticsearch combined_fields query with synonym_graph token filter

I'm trying to use a combined_fields query with a synonym_graph search-time token filter in Elasticsearch. When I query for a multi-term phrase in my synonym file, Elasticsearch seems to unconfigurably switch from "or logic" to "and logic" between my original terms. Here's an example Elasticsearch query that has been exaggerated for demonstration purposes:
GET /products/_search
{
"query": {
"bool": {
"should": [
{
"combined_fields": {
"query": "boxes other rectangle hinged lid hook cutout",
"operator": "or",
"minimum_should_match": 1,
"fields": [
"productTitle^9",
"fullDescription^5"
],
"auto_generate_synonyms_phrase_query": false
}
}
]
}
}
}
When I submit the query on my index with an empty synonyms.txt file, it returns >1000 hits. As expected, the top hits contain all or many of the terms in the query, and the result set is composed of all documents that contain any of the terms. However, when I add this line to the synonyms.txt file:
black spigot, boxes other rectangle hinged lid hook cutout
the query only returns 4 hits. These hits either contain all of the terms in my query across the queried fields, or both the terms "black" and "spigot".
My conclusion is that presence of the phrase in the synonyms file is influencing how the "non-synonym-replaced" phrase is being searched for. This seems counterintuitive - adding a phrase to the synonyms file should only possibly increase the number of results that a search for that exact phrase produces, right?
Does anyone know what I'm doing incorrectly, or if my expectations are reliant upon some fundamental misunderstanding of how Elasticsearch works? I observe the same behavior when I use a multi-match query or an array of match queries, and I've tried every combination of query options that I reasonably think might resolve the problem.
For reference, here is my analyzer configuration:
"analysis": {
"analyzer": {
"indexAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"porter_stem",
"stop"
]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"porter_stem",
"stop",
"productSynonym"
]
}
},
"filter": {
"productSynonym": {
"type": "synonym_graph",
"synonyms_path": "analysis/synonyms.txt"
}
}
}

how to remove a stop word from default _english_ stop-words list in elasticsearch?

I am filtering the text using default English stop-words. I found 'and' is a stop-word in English, but I need to search for the results containing 'and'. I just want to remove and word from this default English stop-words filter and use other stopwords as usually. My elasticsearch schema looks similar to below.
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace" ,
"filter": ["stop_english"]
}
}....,
"filter":{
"stop_english": {
"type": "stop",
"stopwords": "_english_"
}
}
I expect to see the docs containing AND word with _search api.
You can set the stop words for a given index manually like this:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["and", "is", "the"]
}
}
}
}
}
I also found the list of English stop words used by elasticsearch here. If you manage to manually set the same list of stop words minus "and" in an index and reindex you data in the newly configured index with the good stop words , you should be good to go!
regarding reindexation of your data, you should check out the reindex api. I believe it is required since the tokenization of your data happens at ingestion time so you need to redo the ingestion by reindexing it. It is requires most of the time when changing index settings or some mapping changes (not 100% sure, but i think it makes sense).

analyser with ngram token depending on term length

I'm building an analyser to provide partial search on term. So I want to use 2-5 ngram tokenzier at index time and 5-5 ngram at search.
The rational of using 2-5 ngram at index time is that the a partial term query of lenght 2 shall match.
At search, if the search term has a length lower than 5, the term can be searched directly in the inverted index. If it has a len greater than 5, then the term is tokenized with 5-grams and match if all token match.
However, in Elastic, using 5-5 ngram tokenziser won't create any token if the query term has a length lower than 5.
The solution could be to use at search a 2-5 tokenizer, same as for indexing, but this would result in searching all the 2grams, 3grams and 4grams tokens, which is useless... (5grams token is sufficient)
Here is my current index mapping:
{
"settings" : {
"analysis":{
"analyzer":{
"index_partial":{
"type":"custom",
"tokenizer":"2-5_ngram_token"
},
"search_partial":{
"type":"custom",
"tokenizer": "5-5_ngram_token"
}
},
"tokenizer":{
"2-5_ngram_token": {
"type":"nGram",
"min_gram":"2",
"max_gram":"5"
},
"5-5_ngram_token": {
"type":"nGram",
"min_gram":"5",
"max_gram":"5"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"name_trans": {
"type": "text",
"fields": {
"partial": {
"type":"text",
"analyzer":"index_partial",
"search_analyzer":"search_partial"
}
}
}
}
}
}
So my question is : How can create analyzer that would do no-op if the search query has a length lower than 5. If it has a length greater than 5, it creates 5 grams tokens ?
----------------------UPDATE WITH WORK AROUND SOLUTION-----------------------
It seems not possible to create an analyser that do no-op if len < 5 and 5-5ngram if len >= 5.
There is two work around solutions to perform partial:
1- As mentionned by #Amit Khandelwal, one solution is to use max ngrams at index time. If your field has 30 chars max, use a tokenizer with ngram 2-30 and at searh time, search for the exact term, without processing it with the ngram analyser (either via term query or by setting the search analyszer to keyword).
Drawback of this solution is that it could result in huge inverted index depending on the max length.
2- Other solution is to create two fields:
- one for short search query term that can be look for in the inverted index directly, without being tokenized
- one for longer search query term that shall be tokenized
Depending of the length of the search query term, the search shall be performed on either one of those two fields
Below is the mapping I used for solution 2 (the limit between short and long term I chose is len=5):
PUT name_test
{
"settings" : {
"max_ngram_diff": 3,
"analysis":{
"analyzer":{
"2-4nGrams":{
"type":"custom",
"tokenizer":"2-4_ngram_token",
"filter": ["lowercase"]
},
"5-5nGrams":{
"type":"custom",
"tokenizer": "5-5_ngram_token",
"filter": ["lowercase"]
}
},
"tokenizer":{
"2-4_ngram_token": {
"type":"nGram",
"min_gram":"2",
"max_gram":"4"
},
"5-5_ngram_token": {
"type":"nGram",
"min_gram":"5",
"max_gram":"5"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"name_trans": {
"type": "text",
"fields": {
"2-4partial": {
"type":"text",
"analyzer":"2-4nGrams",
"search_analyzer":"keyword"
},
"5-5partial": {
"type":"text",
"analyzer":"5-5nGrams"
}
}
}
}
}
}
and the two kind of request to be used with this mapping depending search term length:
GET name_test/_search
{
"query": {
"match": {
"name_trans.2-4partial": {
"query": "ema",
"operator": "and",
"fuzziness": 0
}
}
}
}
GET name_test/_search
{
"query": {
"match": {
"name_trans.5-5partial": {
"query": "emanue",
"operator": "and",
"fuzziness": 0
}
}
}
Maybe this will help someone someday :)
I am not sure if it's possible in Elasticsearch or not, But I can suggest you a workaround which we also use in our application although our use case was different.
Create a custom analyzer using 2-5 ngram tokenzier on the fields, which you want to use for the partial search, this will store the ngram tokens of the fields in inverted index, for example for a field containing foobar as a value, it will store fo, foo, foob, fooba, oo, oob , ooba, oobar ,ob, oba ,obar, ba, bar, ar.
Now instead of match query use the term query on partial fields, which is not analyzed, you can read diff b/w these here.
So now, in this case, It doesn't matter whether the search term is smaller than 5 or not, it will still match the tokens and you will get the results.
Now lets dry run this on the field containing foobar as a value and test it against some search terms,
Case 1: If search term contains less than 5 chars like fo, oo, ar, bar , oob, oba, bar and ooba, still it will match as these tokens are present in the inverted index.
Case 2: Search term contains equal or more than 5 chars, like fooba, oobar then also it return the document as index contains these tokens.
Let me know if its clear or you require additional clarification.

Query elasticsearch to make all analyzed ngram tokens to match

I indexed some data using a nGram analyzer (which emits only tri-grams), to solve the compound words problem exactly as described at the ES guide.
This doesn't work however as expected: the according match query will return all documents where at least one nGram-token (per word) matched.
Example:
Let's take these two indexed documents with a single field, using that nGram analyzer:
POST /compound_test/doc/_bulk
{ "index": { "_id": 1 }}
{ "content": "elasticsearch is awesome" }
{ "index": { "_id": 2 }}
{ "content": "some search queries don't perform good" }
Now if I run the following query, I get both results:
"match": {
"content": {
"query": "awesome search",
"minimum_should_match": "100%"
}
}
The query that is constructed from this, could be expressed like this:
(awe OR wes OR eso OR ome) AND (sea OR ear OR arc OR rch)
That's why the second document matches (it contains "some" and "search"). It would even match a document with words that contain the tokens "som" and "rch".
What I actually want is a query where each analyzed token must match (in the best case depending on the minimum-should-match), so something like this:
"match": {
"content": {
"query": "awe wes eso ome sea ear arc rch",
"analyzer": "whitespace",
"minimum_should_match": "100%"
}
}
..without actually creating that query "from hand" / pre-analyzing it on client side.
All settings and data to reproduce that behavior can be found at https://pastebin.com/97QxfaSb
Is there such a possibility?
While writing the question, I accidentally found the answer:
If the ngram analyzer uses a ngram-filter to generate trigrams (as described in the guide), it works the way described above. (I guess because the actual tokens are not the single ngrams but the combination of all created ngrams)
To achieve the wanted behavior, the analyzer must use the ngram tokenizer:
"tokenizer": {
"trigram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"trigrams_with_tokenizer": {
"type": "custom",
"tokenizer": "trigram_tokenizer"
}
}
Using this way to produce tokens will result in the wished result when queering that field.

Getting results for multi_match cross_fields query in elasticsearch with custom analyzer

I have an elastic search 5.3 server with products.
Each product has a 14 digit product code that has to be searchable by the following rules. The complete code should match as well as a search term with only the last 9 digits, the last 6, the last 5 or the last 4 digits.
In order to achieve this I created a custom analyser which creates the appropriate tokens at index time using the pattern capture token filter. This seems to be working correctly. The _analyse API shows that the correct terms are created.
To fetch the documents from elastic search I'm using a multi_match cross_fields bool query to search a number of fields simultaneously.
When I have a query string that has a part that matches a product code and a part that matches any of the other fields no results are returned, but when I search for each part separately the appropriate results are returned. Also when I have multiple parts spanning any of the fields except the product code the correct results are returned.
My maping and analyzer:
PUT /store
{
"mappings": {
"products":{
"properties":{
"productCode":{
"analyzer": "ProductCode",
"search_analyzer": "standard",
"type": "text"
},
"description": {
"type": "text"
},
"remarks": {
"type": "text"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"ProductCodeNGram": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": [
"\\d{5}(\\d{9})",
"\\d{8}(\\d{6})",
"\\d{9}(\\d{5})",
"\\d{10}(\\d{4})"
]
}
},
"analyzer": {
"ProductCode": {
"filter": ["ProductCodeNGram"],
"type": "custom",
"preserve_original": "true",
"tokenizer": "standard"
}
}
}
}
}
The query
GET /store/products/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "[query_string]",
"fields": ["productCode", "description", "remarks"],
"type": "cross_fields",
"operator": "and"
}
}
]
}
}
}
Sample data
POST /store/products
{
"productCode": "999999123456789",
"description": "Foo bar",
"remarks": "Foobar"
}
The following query strings all return one result:
"456789", "foo", "foobar", "foo foobar".
But the query_string "foo 456789" returns no results.
I am very curious as to why the last search does not return any results. I am convinced that it should.
The problem is that you are doing a cross_fields over fields with different analysers. Cross fields only works for fields using the same analyser. It in fact groups the fields by analyser before doing the cross fields. You can find more information in this documentation.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#_literal_cross_field_literal_and_analysis
Although cross_fields needs the same analyzer across the fields it operates on, I've had luck using the tie_breaker parameter to allow other fields (that use different analyzers) to be weighed for the total score.
This has the added benefit of allowing per-field boosting to be calculated in the final score, too.
Here's an example using your query:
GET /store/products/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "[query_string]",
"fields": ["productCode", "description", "remarks"],
"type": "cross_fields",
"tie_breaker": 1 # You may need to tweak this
}
}
]
}
}
}
I also removed the operator field, as I believe using the "AND" operator will cause fields that don't have the same analyzer to be scored inappropriately.

Resources