ElasticSearch "more like this" returning empty result - elasticsearch

I made a very simple test to figure out my mistake, but did not find it. I created two indexes and I'm trying to search documents in the ppa index that are similar to a given document in the ods index (like the second example here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html).
These are my settings, mappings and documents for the ppa index:
PUT /ppa
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ppa/_mapping/ppa
{"properties": {"descricao": {"type": "text", "analyzer": "brazilian"}}}
POST /_bulk
{"index":{"_index":"ppa","_type":"ppa"}}
{"descricao": "erradicar a pobreza"}
{"index":{"_index":"ppa","_type":"ppa"}}
{"descricao": "erradicar a pobreza"}
These are my settings, mappings and documents for the ods index:
PUT /ods
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ods/_mapping/ods
{"properties": {"metaodsdescricao": {"type": "text", "analyzer": "brazilian"},"metaodsid": {"type": "integer"}}}
POST /_bulk
{"index":{"_index":"ods","_type":"ods", "_id" : "1" }}
{ "metaodsdescricao": "erradicar a pobreza","metaodsid": 1}
{"index":{"_index":"ods","_type":"ods", "_id" : "2" }}
{"metaodsdescricao": "crianças que vivem na pobreza", "metaodsid": 2}
Now, this search doesn't work:
GET /ppa/ppa/_search
{
"query": {
"more_like_this" : {
"fields" : ["descricao"],
"like" : [
{
"_index" : "ods",
"_type" : "ods",
"_id" : "1"
}
],
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 20
}
}
}
But this one does work:
GET /ppa/ppa/_search
{
"query": {
"more_like_this" : {
"fields" : ["descricao"],
"like" : ["erradicar a pobreza"],
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 20
}
}
}
What is happening?
Please, help me make this return something other than empty.

The "more like this" query work well when you have indexed a lot of data. The empty result can be symptom of very few documents present in the elastic index.

Related

How to find all Nike products with "nikeeeee" keyword

I have an Elasticsearch db with 15 million products.
I want to write a query to find all Nike products when my user search "nikeeeee" or "nikeoff" or "bestnike" or "nikenike" or "nike-Nike" or some keywords like these.
When I used Fazzy query, The result returned was not relevant.
How can i handle it?
Thanks in advance
To find all Nike products when my user search "nikeeeee"
As mentioned by #rabbitbr you can use synonymns
Using synonyms multiple words can point to same token.
Below is working example of it
PUT <index-name>
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym"
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"lenient": true,
"synonyms": [
"nikeeeee,nikeoff,bestnike,nikenike,nike-nike => nike"
]
}
}
}
}
}
}
POST <index-name>/_doc
{
"title":"adidas shoes"
}
GET <index-name>/_search
{
"query": {
"match": {
"title": "bestnike"
}
}
}
Result
"hits" : [
{
"_index" : "index50",
"_type" : "_doc",
"_id" : "ky9VF4QBpvliSuG-OTh-",
"_score" : 0.6931471,
"_source" : {
"title" : "nike shoes"
}
}
]

ElasticSearch: Highlighting with Stemming

I have read this question and attempted to understand the documentation here, but this is complicated.
The problem (I think):
[update 1]
I am using Scala for my code and interface with ES High Level Java API.
I have a stemming analyzer configured. If I search for responsibilities i get results for responsibilities and responsibility. That's great.
BUT
Only the documents with the term responsibilities return highlights.
This is because the search is on the stemmed content , i.e., responsib. However, the highlight is against the unstemmed content. Hence, it finds responsibilities which was a search criteria, but not responsibility, which wasn't.
If I set the highlighter to highlight on the stemmed content, it returns nothing at all. I guess because it is comparing resonsib with responsibilities
Search
I an using the Java high level API. The problem is not the code itself.
Currently, I am highlighting only the content field, returning only responsibilities. Highlighting content.english seems to return nothing
private def buildHighlighter(): HighlightBuilder = {
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder
val highlightBuilder = new HighlightBuilder
val highlightContent = new HighlightBuilder.Field("content")
highlightContent.highlighterType("unified")
highlightBuilder.field(highlightContent)
highlightBuilder
}
Mapping (adumbrated)
{
"settings": {
"number_of_shards": 3,
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": []
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
}
}
},
"content": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
[update 2]
Scala code to implement search:
def searchByField(indices: Seq[ESIndexName], terms: Seq[(String, String)], size: Int = 20): SearchResponse = {
val searchRequest = new SearchRequest
searchRequest.indices(indices.map(idx => idx.completeIndexName()): _*)
searchRequest.source(buildTargetFieldsMatchQuery(terms, size))
searchRequest.indicesOptions(IndicesOptions.strictSingleIndexNoExpandForbidClosed())
client.search(searchRequest, RequestOptions.DEFAULT)
}
and query is built as follows:
private def buildTargetFieldsMatchQuery(termsByField: Seq[(String, String)], size: Int): SearchSourceBuilder = {
val query = new BoolQueryBuilder
termsByField.foreach {
case (field, term) =>
if (field == "content") {
logger.debug(field + " should have " + term)
query.should(new MatchQueryBuilder(field+standardAnalyzer, term.toLowerCase))
query.should(new MatchQueryBuilder(field, term.toLowerCase))
}
else if (field == "title"){
logger.debug(field + " should have " + term)
query.should(new MatchQueryBuilder(field+standardAnalyzer, term.toLowerCase())).boost
}
else {
logger.debug(field + " should have " + term)
query.should(new MatchQueryBuilder(field, term.toLowerCase))
}
}
val sourceBuilder: SearchSourceBuilder = new SearchSourceBuilder()
sourceBuilder.query(query)
sourceBuilder.from(0)
sourceBuilder.size(size)
sourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS))
sourceBuilder.highlighter(buildHighlighter())
}
With plain REST the following is working fine for me:
PUT test
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": []
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
}
POST test/_doc/
{
"content": "This is my responsibility"
}
POST test/_doc/
{
"content": "These are my responsibilities"
}
GET test/_search
{
"query": {
"match": {
"content.english": "responsibilities"
}
},
"highlight": {
"fields": {
"content.english": {
"type": "unified"
}
}
}
}
The result is then:
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "5D5PPGoBqgTTLzdtM-_Y",
"_score" : 0.18232156,
"_source" : {
"content" : "This is my responsibility"
},
"highlight" : {
"content.english" : [
"This is my <em>responsibility</em>"
]
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "5T5PPGoBqgTTLzdtZe8U",
"_score" : 0.18232156,
"_source" : {
"content" : "These are my responsibilities"
},
"highlight" : {
"content.english" : [
"These are my <em>responsibilities</em>"
]
}
}
]
Looking at your Java / Groovy (?) code it looks close enough to the example in the docs. Could you log the actual query you are running, so we can figure out what is going wrong? Generally it should work like this.

filter_duplicate_text not working aggregation query

I'm trying to replicate the filter_duplicate_text example from https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significanttext-aggregation.html.
These are my settings, mapping and documents:
PUT /ods
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ods/_mapping/ods
{"properties": {"descricao": {"type": "text", "analyzer": "brazilian"},"metaodsid": {"type": "integer"}}}
POST /_bulk
{"index":{"_index":"ods","_type":"ods", "_id" : "1" }}
{ "descricao": "erradicar a pobreza","metaodsid": 1}
{"index":{"_index":"ods","_type":"ods", "_id" : "2" }}
{"descricao": "crianças que vivem na pobreza", "metaodsid": 1}
{"index":{"_index":"ods","_type":"ods", "_id" : "3" }}
{"descricao": " Melhorar a educação e adaptação, redução de impacto e da mudança do clima", "metaodsid": 2}
{"index":{"_index":"ods","_type":"ods", "_id" : "4" }}
{"descricao": "Integrar medidas da mudança do clima nas políticas", "metaodsid": 2}
And when I run the following query:
GET /ods/_search
{
"query": {
"bool": {
"filter": {
"term": {
"metaodsid": 2
}
}
}
},
"aggs" : {
"my_sample" : {
"sampler" : {
"shard_size" : 10
},
"aggs": {
"keywords" : {
"filter_duplicate_text": true,
"significant_text" : { "field" : "descricao" }
}
}
}
}
}
I receive back this error message: "Expected [START_OBJECT] under [filter_duplicate_text], but got a [VALUE_BOOLEAN] in [keywords]". I did not realize what is happening because if I remove the line "filter_duplicate_text": true, then the query works as expected.
Does anyone knows how to solve it? Thanks.
Looking at the reference, looks like you got the filter_duplicate_text at the wrong place. It should be a sibling of field not significant_text so like:
"aggs": {
"keywords" : {
"significant_text" : {
"field" : "descricao",
"filter_duplicate_text": true
}
}
}

Elasticsearch phrase suggester prefix phonetic differences

I was wondering if there is any way for the phrase suggester to correct prefix spelling mistakes on phonetic differences.
Elasticsearch 5.1.2
Testing in Kibana 5.1.2
For Example:
Instead of "circus" someone wrote "sircus", or instead of "coding" someone wrote "koding".
Funny thing is, that instead of "phrase" you can write "frase" and get a suggestion.
Here is my setup.
Settings:
PUT text_index
{
"settings": {
"analysis": {
"analyzer": {
"suggests_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"shingle_filter"
],
"type": "custom"
},
"reverse": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "reverse"]
}
},
"filter": {
"shingle_filter": {
"min_shingle_size": 2,
"max_shingle_size": 5,
"type": "shingle"
}
}
}
},
"mappings": {
"testtype": {
"properties": {
"suggest_field": {
"type": "text",
"analyzer": "suggests_analyzer",
"fields": {
"reverse": {
"type": "text",
"analyzer": "reverse"
}
}
}
}
}
}
}
Some documents:
POST test_index/test_type/_bulk
{"index":{}}
{ "suggest_field": "phrase"}
{"index":{}}
{ "suggest_field": "Circus"}
{"index":{}}
{ "suggest_field": "Coding"}
Querying:
POST /so-index/_search
{
"suggest" : {
"text" : "sircus",
"simple_phrase" : {
"phrase" : {
"field" : "suggest_field",
"max_errors": 0.9,
"highlight": {
"pre_tag": "<em>",
"post_tag": "</em>"
},
"direct_generator" : [ {
"field" : "suggest_field",
"suggest_mode" : "always"
}, {
"field" : "suggest_field.reverse",
"suggest_mode" : "always",
"pre_filter" : "reverse",
"post_filter" : "reverse"
}]
}
}
}
}
Also, I repeat following steps a few times (between 5 and 10) without changing anything:
delete index
put index, settings & mappings
add documents
query (codign)
Sometimes I get suggestions and sometimes I don't. Is there any explanation for it?
Try setting "prefix_length": 0 in the direct_generator.

Search results ordered by search-text-length/match length

I have this simple mapping:
PUT testindex
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "edgeNGram"]
}
},
"filter" : {
"ngram" : {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer" : "ngram_analyzer"
}
}
}
}
}
With these values:
PUT testindex/test/1
{"name" : "Power"}
PUT testindex/test/2
{"name" : "Pow"}
PUT testindex/test/3
{"name" : "PowerMax"}
PUT testindex/test/4
{"name" : "PowerRangers"}
And searched this:
GET testindex/test/_search
{
"query": {
"match": {
"name": "Po"
}
}
}
And got:
PowerRangers
Power
Pow
PowerMax
All with the same score of 0.2876821
Clearly, the closest result to "Po" is "Pow", and that I expect to receive first; but I don't.
How Should I modify my mapping to behave by this logic?
I think scripted sorting is the solution, but it comes with a performance decrease drawback. See here more about this. And the query you can use is this:
GET testindex/test/_search
{
"query": {
"match": {
"name": "Po"
}
},
"sort": {
"_script": {
"script": "_source['name'].value.length",
"type": "number",
"order": "asc"
}
}
}

Resources