Is it possible to use a more-like-this query on nested fields? - elasticsearch

I have an "event" type based on a (nested) press article, including the title, and the text, which both have multifields.
I've tried :
{
"query":{
"nested":{
"path":"article",
"query":{
"mlt":{
"fields":["article.title.search","article.text.search"],
"max_query_terms": 20,
"min_term_freq": 1,
"include": "false",
"like":[{
"_index":"myindex",
"_type":"event",
"doc":{
"article":{
"title":"this is the title",
"text":"this is the body of the article"
}
}]
}
}
}
}
}
But it always returns 0 hits

{
"query": {
"nested":{
"path":"articles",
"query":{
"more_like_this" : {
"fields" : ["articles.brand", "articles.category", "articles.material"],
"like" : [
{
"_index" : "$index",
"_type" : "$type",
"_id" : "$id"
}
],
"min_term_freq" : 1,
"max_query_terms" : 20
}
}
}
}
This Works for me, Taking in consideration that the mapping of the nested fields you are using must be defined as term vectors.
"brand": {
"type": "string",
"index": "not_analyzed",
"term_vector": "yes"
}
Refer to: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

Related

How to find similar tags from text using elastic search

I try to use Elastic Search to find most similar tags from text.
For example, I create test_index and insert two documents:
POST test_index/_doc/17
{
"id": 17,
"tags": ["it", "devops", "server"]
}
POST test_index/_doc/20
{
"id": 20,
"tags": ["software", "hardware"]
}
So, i expect to find "software" tag (text or id) from "I'm using some softwares and applications" text.
I was hoping someone can provide an example on how to do this or at least point me in the right direction.
Thanks.
What you are looking for is nothing but a concept called as Stemming. You would need to create a Custom Analyzer and make use of Stemmer Token Filter.
Please find the below mapping, sample documents, query and response:
Mapping:
PUT my_stem_index
{
"settings": {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
},
"mappings": {
"properties": {
"id":{
"type": "keyword"
},
"tags":{
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword":{
"type": "keyword"
}
}
}
}
}
}
From comments, it appears that you are using version < 7. For that you may have to add type in it.
PUT my_stem_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"tokenizer":"standard",
"filter":[
"lowercase",
"my_stemmer"
]
}
},
"filter":{
"my_stemmer":{
"type":"stemmer",
"name":"english"
}
}
}
},
"mappings":{
"_doc":{
"properties":{
"id":{
"type":"keyword"
},
"tags":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
}
}
}
}
Sample Documents:
POST my_stem_index/_doc/17
{
"id": 17,
"tags": ["it", "devops", "server"]
}
POST my_stem_index/_doc/20
{
"id": 20,
"tags": ["software", "hardware"]
}
POST my_stem_index/_doc/21
{
"id": 21,
"tags": ["softwares and applications", "hardwares and storage devices"]
}
Request Query:
POST my_stem_index/_search
{
"query": {
"match": {
"tags": "software"
}
}
}
Response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.5908618,
"hits" : [
{
"_index" : "my_stem_index",
"_type" : "_doc",
"_id" : "20",
"_score" : 0.5908618,
"_source" : {
"id" : 20,
"tags" : [
"software",
"hardware"
]
}
},
{
"_index" : "my_stem_index",
"_type" : "_doc",
"_id" : "21",
"_score" : 0.35965496,
"_source" : {
"id" : 21,
"tags" : [
"softwares and applications", <--- Note this has how `softwares` also was searchable.
"hardwares and storage devices"
]
}
}
]
}
}
Notice in response as how both the documents i.e. having _id 20 and 21 appear.
Additional Note:
If you are new to Elasticsearch, I'd suggest spending sometime to understand the concept of Analysis and how Elasticsearch implements the same using Analyzers.
This would help you understand how the document with softwares and applications is also returning when you only query for software and or vice versa.
Hope this helps!
If you search text that has base or root word, Stemming is good way.
If you need to find most similar word(s) from text, Ngram is more suitable way.
If you search exact words of text in word of tags, Shingles is better way.

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

Elasticsearch query_string filter with Fields when not empty string

Im trying to build a query_string with elasticsearch DSL, my query is sql style is like this :
SELECT NAME,DESCRIPTION, URL, FACEBOOK_URL, YEAR_CREATION FROM MY_INDEX WHERE FACEBOOK_URL<>'' and ( Match('NAME: sometext OR DESCRIPTION: sometext )) AND YEAR_CREATION > 2000
I dont know how to include filter for no empty value for FACEBOOK_URL
Thanks for help...
It's very clear about #Kamal's point. You should examine the type of your "FACEBOOK" field, which must be keyword type but not text.
Please see the below mapping, sample documents, the request query and response.
Note that I may not have added all the fields but only the concerned fields so as to mirror the query you've added.
Mapping:
PUT facebook
{
"mappings": {
"properties": {
"name":{
"type": "text",
"fields": {
"keyword":{
"type":"keyword"
}
}
},
"description":{
"type": "text",
"fields": {
"keyword":{
"type":"keyword"
}
}
},
"facebook_url":{
"type": "keyword"
},
"year_creation":{
"type": "date"
}
}
}
}
Sample Docs:
In the below 4 documents, only the 3rd document mentioned would be something that you would want to be returned.
Docs 1 and 2 have empty values of facebook_url while doc 4 does not have the field in the first place at all.
POST facebook/_doc/1
{
"name": "sometext",
"description": "sometext",
"facebook_url": "",
"year_creation": "2019-01-01"
}
POST facebook/_doc/2
{
"name": "sometext",
"description": "sometext",
"facebook_url": "",
"year_creation": "2019-01-01"
}
POST facebook/_doc/3
{
"name" : "sometext",
"description" : "sometext",
"facebook_url" : "http://mytest.fb.link",
"year_creation" : "2019-01-01"
}
POST facebook/_doc/4
{
"name": "sometext",
"description": "sometext",
"year_creation": "2019-01-01"
}
Request Query:
POST facebook/_search
{
"_source": ["name", "description","facebook_url","year_creation"],
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"match": {
"name": "sometext"
}
},
{
"match": {
"description": "sometext"
}
}
]
}
},
{
"exists": {
"field": "facebook_url"
}
},
{
"range": {
"year_creation": {
"gte": "2000-01-01"
}
}
}
],
"must_not": [
{
"term": {
"facebook_url": {
"value": ""
}
}
}
]
}
}
}
I think the query would be self-explainable.
I have added Exists query so that if the document does not have that field, it would not be appearing the result, however for empty values I've added a clause in must_not.
Notice that in my design, I've used facebook_url as keyword type as it makes no sense to have it in text type. For that reason, I've used Term Query.
Also note that for date filtering, I've made use of Range Query. Do go through the links for more clarification as it is important to understand more on how each of these query works.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 2.148216,
"hits" : [
{
"_index" : "facebook",
"_type" : "_doc",
"_id" : "3",
"_score" : 2.148216,
"_source" : {
"facebook_url" : "http://mytest.fb.link",
"year_creation" : "2019-01-01",
"name" : "sometext",
"description" : "sometext"
}
}
]
}
}
Updated Answer:
Change the field of ANNEE_CREATION from integer to Date field as that is the correct type for the Date fields.
You have not applied range query on the date field based on your query in question.
Note that for must_not apply the logic on keyword field of facebook that you have and not on text field.
{
"query":{
"bool":{
"must":[
{
"query_string":{
"query":" Bordeaux",
"fields":[
"VILLE",
"ADRESSE",
"FACEBOOK"
]
}
},
{
"exists":{
"field":"FACEBOOK"
}
}
],
"must_not":[
{
"term":{
"FACEBOOK.keyword":{ <------ Make sure this is a keyword field
"value":""
}
}
}
],
"filter":[
{
"range":{
"FONDS_LEVEES_TOTAL":{
"gt":0
}
}
},
{
"range":{ <----- Apply the range query here based on what you've mentioned in question
"ANNEE_CREATION":{ <----- Make sure this is the date field
"gte": "2015" <----- Make sure you apply correct query parameter in range query
}
}
}
]
}
},
"track_total_hits":true,
"from":0,
"size":8,
"_source":[
"FACEBOOK",
"NOM",
"ANNEE_CREATION",
"FONDS_LEVEES_TOTAL"
]
}
As expected only the document having Id 3 is returned as result.

Count total number of words of all documents pointing to specific fields

Someone asked this question but no one seems to answer or tried to suggest possible ways to solve it: https://discuss.elastic.co/t/count-the-number-of-words-in-the-field-elastic-search-6-2/121373
Now, I'm trying to produce a report from Elasticsearch to count the number of WORDS / TOKENS from a specific field called title and content
Is there a proper aggregation for this?
For example, I have this query:
GET web/_search
{
"query":{
"bool":{
"must":[
{
"query_string":{
"fields":[
"title",
"content"
],
"query":"((\"Hello\") AND (\"World\")"
}
},
{
"range":{
"pub_date":{
"from":1569456000,
"to":1570060800
}
}
}
]
}
}
}
And for example, this query produced 23 DOCUMENTS, I want to make a response telling me how MANY words do those 23 documents contain based from the title and content fields?
I would leverage the token_count data type. In your index, you can add a sub-field of type token_count to your title and content fields, like this:
PUT web
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
},
"content": {
"type": "text",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
Then, in order to find out the number of tokens, you can simply run a sum aggregation on the .length sub-field, like this:
POST web/_search
{
"size": 0,
"aggs": {
"title_tokens": {
"sum": {
"field": "title.length"
}
},
"content_tokens": {
"sum": {
"field": "content.length"
}
}
}
}
I am using data type called token_count It will calculate and store the count of tokens for each text. This count value can be utilized to get the token count of fields
PUT index18
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "index18",
"_type" : "_doc",
"_id" : "edJPtW0BVHM68p7X-Wlu",
"_score" : 1.0,
"_source" : {
"title" : "Mayor Isko"
}
},
{
"_index" : "index18",
"_type" : "_doc",
"_id" : "etJQtW0BVHM68p7XGmmr",
"_score" : 1.0,
"_source" : {
"title" : "Isko"
}
}
]
Query
GET index18/_search
{
"query": {"match_all": {}},
"aggs": {
"WordCount": {
"sum": {
"field": "title.length"
}
}
}
}

Elastic search: How to highlight the fragment after the search term?

I am working on a search project which requires the highlight fragment after the search word.
My query is
{
"query": {
"multi_match" : {
"query" : "prawn",
"fields": ["name"]
, "operator": "and",
"use_dis_max": true
}
},
"_source": ["name"],
"highlight": {
"fields": {
"name": {
"pre_tags" : [""], "post_tags" : [""],
"fragment_size": 3,
"number_of_fragments": 1
}
}
}
}
Result:
{
"name" : "special prawn curry"
},
"highlight" : {
"name" : [
"special prawn"
]
}
Whereas, I want the result like
"name" : "special prawn curry"
},
"highlight" : {
"name" : [
"prawn curry"
]
}
i.e the fragment after the search word. Is it possible?
Well you can make use of the Plain highlighter (using "type":"plain") in highlight query and see if that works out.
This used to be the default highlighter till 6.0 release where they've made Unified as default highlighter.
POST <your_index_name>/_search
{
"query": {
"multi_match" : {
"query" : "prawn",
"fields": ["name"]
, "operator": "and",
"use_dis_max": true
}
},
"_source": ["name"],
"highlight": {
"fields": {
"name": {
"type": "plain", <---- Added this
"pre_tags" : [""], "post_tags" : [""],
"fragment_size": 3,
"number_of_fragments": 1,
"order": "score"
}
}
}
}
Hope this helps!

Resources