How to find all Nike products with "nikeeeee" keyword - elasticsearch

I have an Elasticsearch db with 15 million products.
I want to write a query to find all Nike products when my user search "nikeeeee" or "nikeoff" or "bestnike" or "nikenike" or "nike-Nike" or some keywords like these.
When I used Fazzy query, The result returned was not relevant.
How can i handle it?
Thanks in advance
To find all Nike products when my user search "nikeeeee"

As mentioned by #rabbitbr you can use synonymns
Using synonyms multiple words can point to same token.
Below is working example of it
PUT <index-name>
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym"
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"lenient": true,
"synonyms": [
"nikeeeee,nikeoff,bestnike,nikenike,nike-nike => nike"
]
}
}
}
}
}
}
POST <index-name>/_doc
{
"title":"adidas shoes"
}
GET <index-name>/_search
{
"query": {
"match": {
"title": "bestnike"
}
}
}
Result
"hits" : [
{
"_index" : "index50",
"_type" : "_doc",
"_id" : "ky9VF4QBpvliSuG-OTh-",
"_score" : 0.6931471,
"_source" : {
"title" : "nike shoes"
}
}
]

Related

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

Count total number of words of all documents pointing to specific fields

Someone asked this question but no one seems to answer or tried to suggest possible ways to solve it: https://discuss.elastic.co/t/count-the-number-of-words-in-the-field-elastic-search-6-2/121373
Now, I'm trying to produce a report from Elasticsearch to count the number of WORDS / TOKENS from a specific field called title and content
Is there a proper aggregation for this?
For example, I have this query:
GET web/_search
{
"query":{
"bool":{
"must":[
{
"query_string":{
"fields":[
"title",
"content"
],
"query":"((\"Hello\") AND (\"World\")"
}
},
{
"range":{
"pub_date":{
"from":1569456000,
"to":1570060800
}
}
}
]
}
}
}
And for example, this query produced 23 DOCUMENTS, I want to make a response telling me how MANY words do those 23 documents contain based from the title and content fields?
I would leverage the token_count data type. In your index, you can add a sub-field of type token_count to your title and content fields, like this:
PUT web
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
},
"content": {
"type": "text",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
Then, in order to find out the number of tokens, you can simply run a sum aggregation on the .length sub-field, like this:
POST web/_search
{
"size": 0,
"aggs": {
"title_tokens": {
"sum": {
"field": "title.length"
}
},
"content_tokens": {
"sum": {
"field": "content.length"
}
}
}
}
I am using data type called token_count It will calculate and store the count of tokens for each text. This count value can be utilized to get the token count of fields
PUT index18
{
"mappings": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
},
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
},
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "index18",
"_type" : "_doc",
"_id" : "edJPtW0BVHM68p7X-Wlu",
"_score" : 1.0,
"_source" : {
"title" : "Mayor Isko"
}
},
{
"_index" : "index18",
"_type" : "_doc",
"_id" : "etJQtW0BVHM68p7XGmmr",
"_score" : 1.0,
"_source" : {
"title" : "Isko"
}
}
]
Query
GET index18/_search
{
"query": {"match_all": {}},
"aggs": {
"WordCount": {
"sum": {
"field": "title.length"
}
}
}
}

ElasticSearch "more like this" returning empty result

I made a very simple test to figure out my mistake, but did not find it. I created two indexes and I'm trying to search documents in the ppa index that are similar to a given document in the ods index (like the second example here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html).
These are my settings, mappings and documents for the ppa index:
PUT /ppa
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ppa/_mapping/ppa
{"properties": {"descricao": {"type": "text", "analyzer": "brazilian"}}}
POST /_bulk
{"index":{"_index":"ppa","_type":"ppa"}}
{"descricao": "erradicar a pobreza"}
{"index":{"_index":"ppa","_type":"ppa"}}
{"descricao": "erradicar a pobreza"}
These are my settings, mappings and documents for the ods index:
PUT /ods
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_stemmer"
]
}
}
}
}
}
PUT /ods/_mapping/ods
{"properties": {"metaodsdescricao": {"type": "text", "analyzer": "brazilian"},"metaodsid": {"type": "integer"}}}
POST /_bulk
{"index":{"_index":"ods","_type":"ods", "_id" : "1" }}
{ "metaodsdescricao": "erradicar a pobreza","metaodsid": 1}
{"index":{"_index":"ods","_type":"ods", "_id" : "2" }}
{"metaodsdescricao": "crianças que vivem na pobreza", "metaodsid": 2}
Now, this search doesn't work:
GET /ppa/ppa/_search
{
"query": {
"more_like_this" : {
"fields" : ["descricao"],
"like" : [
{
"_index" : "ods",
"_type" : "ods",
"_id" : "1"
}
],
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 20
}
}
}
But this one does work:
GET /ppa/ppa/_search
{
"query": {
"more_like_this" : {
"fields" : ["descricao"],
"like" : ["erradicar a pobreza"],
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 20
}
}
}
What is happening?
Please, help me make this return something other than empty.
The "more like this" query work well when you have indexed a lot of data. The empty result can be symptom of very few documents present in the elastic index.

How Elasticsearch relevance score gets calculated?

I am using multi_match with phrase_prefix for full text search in Elasticsearch 5.5. ES query looks like
{
query: {
bool: {
must: {
multi_match: {
query: "butt",
type: "phrase_prefix",
fields: ["item.name", "item.keywords"],
max_expansions: 10
}
}
}
}
}
I am getting following response
[
{
"_index": "items_index",
"_type": "item",
"_id": "2",
"_score": 0.61426216,
"_source": {
"item": {
"keywords": "amul butter, milk, butter milk, flavoured",
"name": "Flavoured Butter"
}
}
},
{
"_index": "items_index",
"_type": "item",
"_id": "1",
"_score": 0.39063013,
"_source": {
"item": {
"keywords": "amul butter, milk, butter milk",
"name": "Butter Milk"
}
}
}
]
Mappings is as follows(I am using default mappings)
{
"items_index" : {
"mappings" : {
"parent_doc": {
...
"properties": {
"item" : {
"properties" : {
"keywords" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
How item with "name": "Flavoured Butter" getting higher score of 0.61426216 against the document with "name": "Butter Milk" and score 0.39063013?
I tried applying boost to "item.name" and removing "item.keywords" form search fields getting same results.
How scores in Elasticsearch works? Are above results correct in terms of relavance?
The scoring for phrase_prefix is similar to that of best_fields, meaning that score of a document is the score obtained from the best_field, which here is item.keywords.
So, item.name isn't adding to score
Refer: multi-match-types
You can use 2 multi_match queries to combine the score from keywords and name.
{
"query": {
"bool": {
"must": [{
"multi_match": {
"query": "butt",
"type": "phrase_prefix",
"fields": [
"item.keywords"
],
"max_expansions": 10
}
},{
"multi_match": {
"query": "butt",
"type": "phrase_prefix",
"fields": [
"item.name"
],
"max_expansions": 10
}
}]
}
}
}

Analyzers in ElasticSearch not working

I am using ElasticSearch to store the Tweets I receive from the Twitter Streaming API. Before storing them I'd like to apply an english stemmer to the Tweet content, and to do that I'm trying to use ElasticSearch analyzers with no luck.
This is the current template I am using:
PUT _template/twitter
{
"template": "139*",
"settings" : {
"index":{
"analysis":{
"analyzer":{
"english":{
"type":"custom",
"tokenizer":"standard",
"filter":["lowercase", "en_stemmer", "stop_english", "asciifolding"]
}
},
"filter":{
"stop_english":{
"type":"stop",
"stopwords":["_english_"]
},
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
}
},
"mappings": {
"tweet": {
"_timestamp": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"_index": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"properties": {
"geo": {
"properties": {
"coordinates": {
"type": "geo_point"
}
}
},
"text": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
When I start the Streaming and the index is created, all the mappings I've defined seem to apply correctly, but the text is stored as it comes from Twitter, completely raw. The index metadata shows:
"settings" : {
"index" : {
"uuid" : "xIOkEcoySAeZORr7pJeTNg",
"analysis" : {
"filter" : {
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
},
"stop_english" : {
"type" : "stop",
"stopwords" : [
"_english_"
]
}
},
"analyzer" : {
"english" : {
"type" : "custom",
"filter" : [
"lowercase",
"en_stemmer",
"stop_english",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "1",
"number_of_shards" : "5",
"version" : {
"created" : "1010099"
}
}
},
"mappings" : {
"tweet" : {
[...]
"text" : {
"analyzer" : "english",
"type" : "string"
},
[...]
}
}
What am I doing wrong? The analyzers seems to be applied correctly, but nothing is happening :/
Thank you!
PS: The search query I use to realize the analyzer is not being applied:
curl -XGET 'http://localhost:9200/_all/_search?pretty' -d '{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "_index:1397574496990"
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"match_all": {}
},
{
"exists": {
"field": "geo.coordinates"
}
}
]
}
}
}
},
"fields": [
"geo.coordinates",
"text"
],
"size": 50000
}'
This should return the stemmed text as one of the fields, but the response is:
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 47,
"successful": 47,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.97402453,
"hits": [
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086643423068161",
"_score": 0.97402453,
"fields": {
"geo.coordinates": [
-118.21122533,
33.79349318
],
"text": [
"Happy turtle Tuesday ! The week is slowly crawling to Wednesday good morning everyone 🌊🐢🐢🐢☀️#turtles… http://t.co/wAVmcxnf76"
]
}
},
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086701451259904",
"_score": 0.97333175,
"fields": {
"geo.coordinates": [
-81.017636,
33.998741
],
"text": [
"Tuesday is Twins Day over here, apparently (it's a far too often occurrence) #tuesdaytwinsday… http://t.co/Umhtp6SoX6"
]
}
}
]
}
}
The text field is exactly the same that came from Twitter (I'm using the streaming API). What I expect is the text fields stemmed, as the analyzer is applied.
Analyzers don't affect the way data is stored. So, no matter which analyzer you are using you will get the same text back from source and stored fields. Analyzer are applied when you search. So by searching for something like text:twin and finding records with the word Twins, you will know that stemmer was applied.

Resources