In the index I'm building, I'm interested in running a query, then (using facets) returning the shingles of that query. Here's the analyzer I'm using on the text:
{
"settings": {
"analysis": {
"analyzer": {
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer"
]
}
},
"filter": {
"custom_stemmer" : {
"type": "stemmer",
"name": "english"
},
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "3"
}
}
}
}
}
The major issue is that, with Lucene 4.4, stop filters no longer support the enable_position_increments parameter to eliminate shingles that contain stop words. Instead, I'd get results like..
"red and yellow"
"terms": [
{
"term": "red",
"count": 43
},
{
"term": "red _",
"count": 43
},
{
"term": "red _ yellow",
"count": 43
},
{
"term": "_ yellow",
"count": 42
},
{
"term": "yellow",
"count": 42
}
]
Naturally this GREATLY skews the number of shingles returned. Is there a way post-Lucene 4.4 to manage this without doing post-processing on the results?
Probably not the most optimal solution, but the most blunt would be to add another filter to your analyzer to kill "_" filler tokens. In the example below I called it "kill_fillers":
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer",
"kill_fillers"
],
...
Add "kill_fillers" filter to your list of filters:
"filters":{
...
"kill_fillers": {
"type": "pattern_replace",
"pattern": ".*_.*",
"replace": "",
},
...
}
im not sure if this helps, but in elastic definition of shingles, you can use the parameter filler_token which is by default _. set it to, for example, an empty string:
$indexParams['body']['settings']['analysis']['filter']['shingle-filter']['filler_token'] = "";
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/analysis-shingle-tokenfilter.html
Related
I have index with this settings
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter",
"asciifolding",
"elision",
"standard"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "32"
}
}
}
and have mapping for the name field
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
now I have several examples of names in documents. Name is one field with first name and last name inside.
--макс---
-макс -
{something} макс
макс {something}
I am using this query to find the documents with that name with alphabetical sorting
{
"query": {
"match": {
"name": {
"query": "макс",
"operator" : "and"
}
}
},
"sort": [
{"name.keyword" : "asc"}
]
}
it is bringing results as I wrote. but I expect that макс {something} will come for the first position than others because it is starting with a query which I wrote.
Can somebody help be there
So the query is by default scoring documents based on "how well they matched", this score is used to rank the "best matches first". But as soon as you define an sort you are saying ignore the query score and only using this field to rank the results. Now the results are still restricted to only documents matching the query but the idea of best match is lost unless you keep the special value _score in your sort statement somewhere.
Like this:
"sort": [
{
"productLine.keyword": {
"order": "desc"
}
},
{
"_score": {
"order": "desc"
}
}
]
Maybe you can just remove the sort and get the results you want based on default score sorting. Include a few example documents to make this fully reproducible if you want more support from the SO community
Let's assume these basic documents as an example:
{
"name": "pants",
"description": "with stripes",
"items": [
{
"color": "red",
"size": "44"
},
{
"color": "blue",
"size": "38"
}
]
}
{
"name": "shirt",
"description": "with stripes",
"items": [
{
"color": "green",
"size": "40"
}
]
}
{
"name": "pants",
"description": "with dots",
"items": [
{
"color": "green",
"size": "38"
},
{
"color": "blue",
"size": "38"
}
]
}
I need to find the first document with a search term like pants stripes blue 38. All terms should be connected with AND as I'm not interested in pants with dots or other size and color combinations.
My mapping looks like this:
{
"settings": {
"index.queries.cache.enabled": true,
"index.number_of_shards": 1,
"index.number_of_replicas": 2,
"analysis": {
"filter": {
"german_stop": {
"type": "stop",
"stopwords": "_german_"
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
},
"synonym": {
"type": "synonym_graph",
"synonyms_path": "dictionaries/de/synonyms.txt",
"updateable" : true
}
},
"analyzer": {
"index_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"german_stop",
"german_normalization",
"german_stemmer"
]
},
"search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym",
"german_stop",
"german_normalization",
"german_stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "index_analyzer",
"search_analyzer": "search_analyzer"
},
"description": {
"type": "text",
"analyzer": "index_analyzer",
"search_analyzer": "search_analyzer"
},
"items": {
"type": "nested",
"properties": {
"color": {
"type": "text",
"analyzer": "index_analyzer",
"search_analyzer": "search_analyzer"
},
"size": {
"type": "text",
"analyzer": "index_analyzer",
"search_analyzer": "search_analyzer"
}
}
}
}
}
}
Please ignore the fact that I'm using german stop words and such. I kept the example files above in english so that everyone can understand it but didn't adjust the mapping as the original example is in german.
So ideally what I want my query to look like is this:
{
"query": {
"nested": {
"path": "items",
"query": {
"multi_match": {
"query": "pants stripes blue 38",
"fields": [
"name",
"description",
"items.color",
"items.size"
],
"type": "cross_fields",
"operator": "and",
"auto_generate_synonyms_phrase_query": "false",
"fuzzy_transpositions": "false"
}
}
}
}
}
And the Search Profiler from Kibana show that the query will be executed like this:
ToParentBlockJoinQuery (
+(
+(items.color:pant | items.size:pant | name:pant | description:pant)
+(items.color:strip | items.size:strip | name:strip | description:strip)
+(items.color:blu | items.size:blu | name:blu | description:blu)
+(items.color:38 | items.size:38 | name:38 | description:38)
) #_type:__items)
Which looks to be exactly what I need in terms of AND and OR logic. Search through every attribute with every term and connect those results with AND. So every search term needs to be in one of the fields but it doesn't matter in which it was found.
But this query seems to only search inside the nested documents. In fact it seems like each query can only search through nested objects or the root document. Not both at the same time. If I remove the nested part the Search Profiler shows the difference:
{
"query": {
"multi_match": {
"query": "pants stripes blue 38",
"fields": [
"name",
"description",
"items.color",
"items.size"
],
"type": "cross_fields",
"operator": "and",
"auto_generate_synonyms_phrase_query": "false",
"fuzzy_transpositions": "false"
}
}
}
Results in:
+(
+(items.color:pant | items.size:pant | name:pant | description:pant)
+(items.color:strip | items.size:strip | name:strip | description:strip)
+(items.color:blu | items.size:blu | name:blu | description:blu)
+(items.color:38 | items.size:38 | name:38 | description:38)
) #DocValuesFieldExistsQuery [field=_primary_term]
Both queries return zero results.
So my question is if there is a way to make the above query work and to be able to truly search across all defined fields (nested and root doc) within a multi match query on a term by term basis.
I would like to avoid doing any preprocessing to the search terms in order to split them up based on them being in a nested or root document as that has it's own set of challenges. But I do know that this is a solution to my problem.
Edit
The original files have a lot more attributes. The root document might have up to 250 fields and each nested document might add another 20-30 fields to it. Because the search terms need to search through lot of the fields (possibly not all) any sort of concatenation of nested and root document attributes to make them "searchable" seems unpractical.
A flattened index might be a practical solution. By that I mean copying all root documents fields to the nested document and only indexing the nested docs. But in this question I would like to know if it also works with nested objects without modifying the original structure.
Your intuition regarding flattening is correct but you don't need to copy the root attributes onto the nested fields. You could do the opposite -- through the include_in_root mapping parameter.
When you update your mapping like so:
PUT inventory
{
"settings": {
...
}
},
"mappings": {
"properties": {
...
"items": {
"type": "nested",
"include_in_root": true, <---
"properties": {
...
}
}
}
}
}
and then index some sample docs (at least one of which includes pants because your original question contained none):
POST inventory/_doc
{"name":"shirt","description":"with stripes","items":[{"color":"red","size":"44"},{"color":"blue","size":"38"}]}
POST inventory/_doc
{"name":"shirt","description":"with stripes","items":[{"color":"green","size":"40"}]}
POST inventory/_doc
{"name":"shirt","description":"with dots","items":[{"color":"green","size":"38"},{"color":"blue","size":"38"}]}
// this one *should* match
POST inventory/_doc
{"name":"pants","description":"with stripes","items":[{"color":"red","size":"44"},{"color":"blue","size":"39"}]}
POST inventory/_doc
{"name":"pants","description":"with stripes","items":[{"color":"red","size":"44"},{"color":"blue","size":"38"}]}
You can then use your 2nd query and keep the nested field paths as they are because they're now available in the root, though somewhat confusingly under the same dot-paths:
POST inventory/_search
{
"query": {
"multi_match": {
"query": "pants stripes blue 38",
"fields": [
"name",
"description",
"items.color",
"items.size"
],
"type": "cross_fields",
"operator": "AND",
"auto_generate_synonyms_phrase_query": "false",
"fuzzy_transpositions": "false"
}
}
}
and only the one fully matching document will be returned:
{
"name":"pants",
"description":"with stripes",
"items":[
{
"color":"red",
"size":"44"
},
{
"color":"blue",
"size":"38"
}
]
}
Suppose there is the following mapping with Edge NGram Tokenizer:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
And the following documents are indexed:
POST /tag/tag/_bulk
{"index":{}}
{"name" : "HITS FIND SOME"}
{"index":{}}
{"name" : "TRENDING HI"}
{"index":{}}
{"name" : "HITS OTHER"}
Then searching
{
"query": {
"match": {
"name": {
"query": "HI"
}
}
}
}
yields all with the same score, or TRENDING - HI with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {}
}
}
Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
{
"query": {
"bool": {
"must" : [
{
"match": {
"name": "HI"
}
}
],
"should": [
{
"prefix": {
"name": "HI"
}
}
]
}
},
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {
"highlight_query": {
"match": {
"name": "HI"
}
}
}
}
}
}
In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "HI"
}
},
{
"match_phrase_prefix": {
"name": "HI"
}
}
]
}
}
}
The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.
A possible solution for this problem is to use multifields. They allow for indexing of the same data from your source document in different ways. In your case you could index the name field as default text, then as ngrams and also as edgengrams. Then the query would have to be a bool query comparing with all those different fields.
The final score of documents is composed of the match value for each one. Those matches are also called signals, signalling that there is a match between the query and the document. The document with most signals matching gets the highest score.
In your case all documents would match the ngram HI. But only the HITS FIND SOME and the HITS OTHER document would get the edgengram additional score. This would give those two documents a boost and put them on top. The complication with this is that you have to make sure that the edgengram doesn't split on whitespaces, because then the HI at the end would get the same score as in the beginning of the document.
Here is an example mapping and query for your case:
PUT /tag/
{
"settings": {
"analysis": {
"analyzer": {
"edge_analyzer": {
"tokenizer": "edge_tokenizer"
},
"kw_analyzer": {
"tokenizer": "kw_tokenizer"
},
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer"
},
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"kw_tokenizer": {
"type": "keyword"
},
"edge_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
},
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
},
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"fields": {
"edge": {
"type": "text",
"analyzer": "edge_analyzer"
},
"ngram": {
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}
}
}
}
}
And a query:
POST /tag/_search
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"match": {
"name.edge": {
"query": "HI"
}
}
},
"boost": "5",
"boost_mode": "multiply"
}
},
{
"match": {
"name.ngram": {
"query": "HI"
}
}
},
{
"match": {
"name": {
"query": "HI"
}
}
}
]
}
}
}
I'm new with elasticsearch and I'm trying to develop a search for an ecommerce to suggested 5~10 matching products to the user.
As it should work while the user is typing, we found in the official documentation the use of edge_ngram and it KIND OF worked. But as we searched to test, the results were not the expected. As shows the example below (in our test)
Searching example
As it is shown in the image, the result for the term "Furadeira" (Power Drill) returns accessories before the power drill itself. How can I enhance the results? Even the order where the match is found in the string would help me, I guess.
So, this is the code I have until now:
//PUT example
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
},
"portuguese_stop": {
"type": "stop",
"stopwords": "_portuguese_"
},
"portuguese_stemmer": {
"type": "stemmer",
"language": "light_portuguese"
}
},
"analyzer": {
"portuguese": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"portuguese_stop",
"portuguese_stemmer"
]
},
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
/* mapping */
//PUT /example/products/_mapping
{
"products": {
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
/* Search */
//GET /example/products/_search
{
"query" : {
"query_string": {
"query" : "furadeira",
"type" : "most_fields", // Tried without this aswell
"fields" : [
"name^8",
"model^10",
"manufacturer^4",
"description"
]
}
}
}
/* Product example */
// PUT example/products/38313
{
"name": "FITA VEDA FRESTA (ESPUMA 4503) 12X5 M [ H0000164055 ]",
"description": "Caracteristicas do produto:Ve…Diminui ruidos indesejaveis.",
"price":21.90,
"product_id": 38313,
"image": "http://placehold.it/200x200",
"quantity": 92,
"width": 20.200,
"height": 1.500,
"length": 21.500,
"weight": 0.082,
"model": "167083",
"manufacturer": "3M DO BRASIL"
}
Thanks in advance.
you could enhance your query to be a so-called boolean query, which contains your existing query in a must clause, but have an additional query in a should clause, that matches exactly (not using the ngrammed field). If the query matches the should clause it will be scored higher.
See the bool query documentation.
let's assume you have a field that differentiates the Main product from Accessories. I call it level_field.
now you can have two approaches to go:
1) boost up The Main product _score by adding 'should' operation:
put your main query in the must operation and in should operation use level_field to boost the _score of documents which are the Main products.
{
"query": {
"bool": {
"must": {
"match": {
"name": {
"query": "furadeira"
}
}
},
"should": [
{ "match": {
"level_field": {
"query": "level1",
"boost": 3
}
}},
{ "match": {
"level_field": {
"query": "level2",
"boost": 2
}
}}
]
}
}
}
2) in second approach you can decrease _score for documents that they are not the Main products by using boosting query:
{
"query": {
"boosting": {
"positive": {
"query_string": {
"query" : "furadeira",
"type" : "most_fields",
"fields" : [
"name^8",
"model^10",
"manufacturer^4",
"description"
]
}
}
},
"negative": {
"term": {
"level_field": {
"value": "level2"
}
}
},
"negative_boost": 0.2
}
}
}
I hope it helps
I am trying to index strings that contain hyphens but do not contain spaces, periods or any other punctuation. I do not want to split up the words based on hyphens, instead I would like to have the hyphens be part of the indexed text.
For example, my 6 text strings would be:
magazineplayon
magazineofhorses
online-magazine
best-magazine
friend-of-magazines
magazineplaygames
I would like to be able to search these string for the text containing "play" or for the text starting with "magazine".
I have been able to use ngram to make the text containing "play" work properly. However, the hyphen is causing text to split and it is including results where "magazine" is in the word after a hyphen. I only want words starting at the beginning of the string with "magazine" to appear.
Based on the sample above, only these 3 should appear when beginning with "magazine":
magazineplayon
magazineofhorses
magazineplaygames
Please help with my ElasticSearch Index Sample:
DELETE /sample
PUT /sample
{
"settings": {
"index.number_of_shards":5,
"index.number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
},
"word_delimiter_filter": {
"type": "word_delimiter",
"preserve_original": true,
"catenate_all" : true
}
},
"analyzer": {
"ngram_index_analyzer": {
"type" : "custom",
"tokenizer": "lowercase",
"filter" : ["nGram_filter", "word_delimiter_filter"]
}
}
}
}
}
PUT /sample/1/_create
{
"name" : "magazineplayon"
}
PUT /sample/3/_create
{
"name" : "magazineofhorses"
}
PUT /sample/4/_create
{
"name" : "online-magazine"
}
PUT /sample/5/_create
{
"name" : "best-magazine"
}
PUT /sample/6/_create
{
"name" : "friend-of-magazines"
}
PUT /sample/7/_create
{
"name" : "magazineplaygames"
}
GET /sample/_search
{
"query": {
"wildcard": {
"name": "*play*"
}
}
}
GET /sample/_search
{
"query": {
"wildcard": {
"name": "magazine*"
}
}
}
Update 1
I updated all my create statements to use TEST after sample:
PUT /sample/test/7/_create
{
"name" : "magazinefairplay"
}
I then ran the following command to return only names that had the word "play" in them instead of doing the wildcard search. This worked correctly and returned only two records.
POST /sample/test/_search
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{"match": { "name.substrings": "play" }}
]
}
}
}
I ran the following command to return only names that started with "magazine". My expectation was that "online-magazine", "best-magazine" and "friend-of-magazines" would not appear. However, all seven records were returned including these three.
POST /sample/test/_search
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{"match": { "name.prefixes": "magazine" }}
]
}
}
}
Is there a way to filter out the prefix where the hyphen is used?
You're on the right path, however, you need to also add another analyzer that leverages the edge-ngram token filter in order to make the "starts with" contraint work. You can keep the ngram for checking fields that "contain" a given word, but you need edge-ngram to check that a field "starts with" some token.
PUT /sample
{
"settings": {
"index.number_of_shards": 5,
"index.number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
},
"edgenGram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"ngram_index_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"nGram_filter"
]
},
"edge_ngram_index_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"edgenGram_filter"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"fields": {
"prefixes": {
"type": "string",
"analyzer": "edge_ngram_index_analyzer",
"search_analyzer": "standard"
},
"substrings": {
"type": "string",
"analyzer": "ngram_index_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
Then your query will become (i.e. search for all documents whose name field contains play or starts with magazine)
POST /sample/test/_search
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{"match": { "name.substrings": "play" }},
{"match": { "name.prefixes": "magazine" }}
]
}
}
}
Note: don't use wildcard for searching for substrings, as it will kill the performance of your cluster (more info here and here)