filter_duplicate_text not working aggregation query - elasticsearch

I'm trying to replicate the filter_duplicate_text example from https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significanttext-aggregation.html.
These are my settings, mapping and documents:
PUT /ods
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ods/_mapping/ods
{"properties": {"descricao": {"type": "text", "analyzer": "brazilian"},"metaodsid": {"type": "integer"}}}
POST /_bulk
{"index":{"_index":"ods","_type":"ods", "_id" : "1" }}
{ "descricao": "erradicar a pobreza","metaodsid": 1}
{"index":{"_index":"ods","_type":"ods", "_id" : "2" }}
{"descricao": "crianças que vivem na pobreza", "metaodsid": 1}
{"index":{"_index":"ods","_type":"ods", "_id" : "3" }}
{"descricao": " Melhorar a educação e adaptação, redução de impacto e da mudança do clima", "metaodsid": 2}
{"index":{"_index":"ods","_type":"ods", "_id" : "4" }}
{"descricao": "Integrar medidas da mudança do clima nas políticas", "metaodsid": 2}
And when I run the following query:
GET /ods/_search
{
"query": {
"bool": {
"filter": {
"term": {
"metaodsid": 2
}
}
}
},
"aggs" : {
"my_sample" : {
"sampler" : {
"shard_size" : 10
},
"aggs": {
"keywords" : {
"filter_duplicate_text": true,
"significant_text" : { "field" : "descricao" }
}
}
}
}
}
I receive back this error message: "Expected [START_OBJECT] under [filter_duplicate_text], but got a [VALUE_BOOLEAN] in [keywords]". I did not realize what is happening because if I remove the line "filter_duplicate_text": true, then the query works as expected.
Does anyone knows how to solve it? Thanks.

Looking at the reference, looks like you got the filter_duplicate_text at the wrong place. It should be a sibling of field not significant_text so like:
"aggs": {
"keywords" : {
"significant_text" : {
"field" : "descricao",
"filter_duplicate_text": true
}
}
}

Related

How to find all Nike products with "nikeeeee" keyword

I have an Elasticsearch db with 15 million products.
I want to write a query to find all Nike products when my user search "nikeeeee" or "nikeoff" or "bestnike" or "nikenike" or "nike-Nike" or some keywords like these.
When I used Fazzy query, The result returned was not relevant.
How can i handle it?
Thanks in advance
To find all Nike products when my user search "nikeeeee"
As mentioned by #rabbitbr you can use synonymns
Using synonyms multiple words can point to same token.
Below is working example of it
PUT <index-name>
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym"
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"lenient": true,
"synonyms": [
"nikeeeee,nikeoff,bestnike,nikenike,nike-nike => nike"
]
}
}
}
}
}
}
POST <index-name>/_doc
{
"title":"adidas shoes"
}
GET <index-name>/_search
{
"query": {
"match": {
"title": "bestnike"
}
}
}
Result
"hits" : [
{
"_index" : "index50",
"_type" : "_doc",
"_id" : "ky9VF4QBpvliSuG-OTh-",
"_score" : 0.6931471,
"_source" : {
"title" : "nike shoes"
}
}
]

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

ElasticSearch "more like this" returning empty result

I made a very simple test to figure out my mistake, but did not find it. I created two indexes and I'm trying to search documents in the ppa index that are similar to a given document in the ods index (like the second example here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html).
These are my settings, mappings and documents for the ppa index:
PUT /ppa
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ppa/_mapping/ppa
{"properties": {"descricao": {"type": "text", "analyzer": "brazilian"}}}
POST /_bulk
{"index":{"_index":"ppa","_type":"ppa"}}
{"descricao": "erradicar a pobreza"}
{"index":{"_index":"ppa","_type":"ppa"}}
{"descricao": "erradicar a pobreza"}
These are my settings, mappings and documents for the ods index:
PUT /ods
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ods/_mapping/ods
{"properties": {"metaodsdescricao": {"type": "text", "analyzer": "brazilian"},"metaodsid": {"type": "integer"}}}
POST /_bulk
{"index":{"_index":"ods","_type":"ods", "_id" : "1" }}
{ "metaodsdescricao": "erradicar a pobreza","metaodsid": 1}
{"index":{"_index":"ods","_type":"ods", "_id" : "2" }}
{"metaodsdescricao": "crianças que vivem na pobreza", "metaodsid": 2}
Now, this search doesn't work:
GET /ppa/ppa/_search
{
"query": {
"more_like_this" : {
"fields" : ["descricao"],
"like" : [
{
"_index" : "ods",
"_type" : "ods",
"_id" : "1"
}
],
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 20
}
}
}
But this one does work:
GET /ppa/ppa/_search
{
"query": {
"more_like_this" : {
"fields" : ["descricao"],
"like" : ["erradicar a pobreza"],
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 20
}
}
}
What is happening?
Please, help me make this return something other than empty.
The "more like this" query work well when you have indexed a lot of data. The empty result can be symptom of very few documents present in the elastic index.

Elasticsearch "failed to find analyzer"

I have created a synonym analyser on an index:
curl http://localhost:9200/test_index/_settings?pretty
{
"test_index" : {
"settings" : {
"index" : {
"creation_date" : "1429175067557",
"analyzer" : {
"search_synonyms" : {
"filter" : [ "lowercase", "search_synonym_filter" ],
"tokenizer" : "standard"
}
},
"uuid" : "Zq6Id8xsRWGofJrNCb7M8w",
"number_of_replicas" : "1",
"analysis" : {
"filter" : {
"search_synonym_filter" : {
"type" : "synonym",
"synonyms" : [ "sneakers,pumps" ]
}
}
},
"number_of_shards" : "5",
"version" : {
"created" : "1050099"
}
}
}
}
}
But when I try to use it with the mapping:
curl -XPUT 'http://localhost:9200/test_index/_mapping/product_catalog?pretty' -H "Content-Type: application/json" \
-d '{"product_catalog": {"properties" : {"name": {"type": "string", "include_in_all": true, "analyzer":"search_synonyms"} }}}'
I get the error:
{
"error" : "MapperParsingException[Analyzer [search_synonyms] not found for field [name]]",
"status" : 400
}
I have also tried to just check the analyser with:
curl 'http://localhost:9200/test_index/_analyze?analyzer=search_synonyms&pretty=1&text=pumps'
but still get an error:
ElasticsearchIllegalArgumentException[failed to find analyzer [search_synonyms]]
Any ideas, I may be missing something but I can't think what.
The analyzer element has to be inside your analysis component. Change your index creator as follows:
{
"settings": {
"index": {
"creation_date": "1429175067557",
"uuid": "Zq6Id8xsRWGofJrNCb7M8w",
"number_of_replicas": "0",
"analysis": {
"filter": {
"search_synonym_filter": {
"type": "synonym",
"synonyms": [
"sneakers,pumps"
]
}
},
"analyzer": {
"search_synonyms": {
"filter": [
"lowercase",
"search_synonym_filter"
],
"tokenizer": "standard"
}
}
},
"number_of_shards": "5",
"version": {
"created": "1050099"
}
}
}
}

Search results ordered by search-text-length/match length

I have this simple mapping:
PUT testindex
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "edgeNGram"]
}
},
"filter" : {
"ngram" : {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer" : "ngram_analyzer"
}
}
}
}
}
With these values:
PUT testindex/test/1
{"name" : "Power"}
PUT testindex/test/2
{"name" : "Pow"}
PUT testindex/test/3
{"name" : "PowerMax"}
PUT testindex/test/4
{"name" : "PowerRangers"}
And searched this:
GET testindex/test/_search
{
"query": {
"match": {
"name": "Po"
}
}
}
And got:
PowerRangers
Power
Pow
PowerMax
All with the same score of 0.2876821
Clearly, the closest result to "Po" is "Pow", and that I expect to receive first; but I don't.
How Should I modify my mapping to behave by this logic?
I think scripted sorting is the solution, but it comes with a performance decrease drawback. See here more about this. And the query you can use is this:
GET testindex/test/_search
{
"query": {
"match": {
"name": "Po"
}
},
"sort": {
"_script": {
"script": "_source['name'].value.length",
"type": "number",
"order": "asc"
}
}
}

Resources