multiple words act as single word in search - Elasticsearch - elasticsearch

I have an issue with tags such as social media, two words, tag with many spaces have a multiplied score for each word in search query.
How can I achieve to search two words as one word instead getting different score when searching two and two words
Here is a visual representation the current results score:
+-----------------------+-------+
| search | score |
+-----------------------+-------+
| two | 2.76 |
| two words | 5.53 |
| tag with many spaces | 11.05 |
| singleword | 2.76 |
Here is a visual representation of what I want:
+-----------------------+-------+
| search | score |
+-----------------------+-------+
| two | 2.76 |
| two words | 2.76 |
| tag with many spaces | 2.76 |
| singleword | 2.76 |
There are multiple tags in each document. each tag search is broken down by a comma , in PHP and outputted like the query below
Assuming a document has multiple tags including two words and singleword, this would be the search query:
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"match": {
"tags.name": "two words"
}
},
{
"match": {
"tags.name": "singleword"
}
}
]
}
},
"functions": [
{
"field_value_factor": {
"field": "tags.votes"
}
}
],
"boost_mode": "multiply"
}
}
The score will be different if searching two instead of two words
Here is how the result looks like when searching two words
{
"_index": "index",
"_type": "type",
"_id": "u10q42cCZsbFNf1W0Tdq",
"_score": 4.708793,
"_source": {
"url": "example.com",
"title": "title of the document",
"description": "some description of the document",
"popularity": 9,
"tags": [
{
"name": "two words",
"votes": 1
},
{
"name": "singleword",
"votes": 1
},
{
"name": "othertag",
"votes": 1
},
{
"name": "random",
"votes": 1
}
]
}
}
Here is the result when searching two instead of two words
{
"_index": "index",
"_type": "type",
"_id": "u10q42cCZsbFNf1W0Tdq",
"_score": 3.4481666,
"_source": {
"url": "example.com",
"title": "title of the document",
"description": "some description of the document",
"popularity": 9,
"tags": [
{
"name": "two words",
"votes": 1
},
{
"name": "singleword",
"votes": 1
},
{
"name": "othertag",
"votes": 1
},
{
"name": "random",
"votes": 1
}
]
}
}
Here is the mapping (for the tags specifically)
"tags": {
"type": "nested",
"include_in_parent": true,
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"votes": {
"type": "long"
}
}
}
I have tried searching with "\"two words\"" and "*two words*" but it gave no difference.
Is it possible to achieve this?

You should use the non analyzed string for your matching and switch to a term query.
Can you try :
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"term": {
"tags.name.keyword": "two words"
}
},
{
"term": {
"tags.name.keyword": "singleword"
}
}
]
}
},
"functions": [
{
"field_value_factor": {
"field": "tags.votes"
}
}
],
"boost_mode": "multiply"
}
}
With your actual implementation, when you do a match query with the query "two words" it will analyze your query to search for token "two" and "words" in your tags. So documents with tag "two words" will match the two tokens and will be boosted.

Related

Elasticsearch score from 0 to 1 for searching similar documents to the one that exists

Need to calculate relative score from 0 to 1 when searching similar documents to existing one?
So existing one has score 1, and all other matching documents score should be calculated according to this and score will be <= 1. But existing document should be excluded from the search. Is it possible to do it on elasticsearch side, not just calculating score manually in a programming language like:
match_doc_score/search_doc_score
Let's imagine we have index person with mapping:
{
"properties": {
"person_id": {
"type": "keyword"
},
"fullname": {
"type": "text"
},
"email": {
"type": "keyword"
},
"phone": {
"type": "keyword"
},
"country_of_birth": {
"type": "keyword"
}
}
}
And I have 3 persons inside the index:
Person 1:
{
"person_id": 1,
"fullname": "John Snow",
"email": "john#gmail.com",
"phone": "111-11-11",
"country_of_birth": "Denmark"
}
Person 2:
{
"person_id": 2,
"fullname": "Snow John",
"email": "john#gmail.com",
"phone": "222-22-22",
"country_of_birth": "Denmark"
}
Person 3:
{
"person_id": 3,
"fullname": "Peter Wislow",
"email": "peter#gmail.com",
"phone": "111-11-11",
"country_of_birth": "Denmark"
}
We find persons that are similar to Person 1 by this query:
{
"query": {
"bool": {
"should": [
{
"match": {
"fullname": {
"query": "John Snow",
"boost": 6
}
}
},
{
"term": {
"email": {
"value": "john#gmail.com",
"boost": 5
}
}
},
{
"term": {
"phone": {
"value": "111-11-11",
"boost": 4
}
}
},
{
"term": {
"country_of_birth": {
"value": "Denmark",
"boost": 2
}
}
}
],
"must_not": [
{
"term": {
"person_id": 123
}
}
]
}
}
}
As you can see:
person 1 and person 2 match by: fullname, email, country of birth.
person 1 and person 3 match by: phone, country of birth.
Is it possible to have 0..1 scoring if we have order with full match in the index(person 1)?
I know there is a more_like_this query, but in real life search queries can be complicated so more_like_this is not a good option. Even elasticsearch documentation says that if you need more control over the query, then use boolean query combinations.
Have not tried but looks like field value factor of function score might solve your query.

Using filters to count values in Kibana / Visualize?

(I am quite new to ELK stack and may ask something obvious...)
I have documents describing customers informations, with data such as name, address, age, etc...
Sometimes, not all these fields exist and I would like to know the number of documents having them filled.
If the data looks like:
PUT customers
{
"mappings": {
"customer": {
"properties": {
"id": {
"type": "integer"
},
"category": {
"type": "keyword"
},
"email": {
"type": "text"
},
"age": {
"type": "integer"
},
"address": {
"type": "text"
}
}
}
}
}
POST _bulk
{"index":{"_index":"customers","_type":"customer"}}
{"id":"1","category":"aa","email":"sam#test.com"}
{"index":{"_index":"customers","_type":"customer"}}
{"id": "2", "category" : "aa", "age": "5"}
{"index":{"_index":"customers","_type":"customer"}}
{"id": "3", "category" : "aa", "email": "bob#test.com", "age": "36"}
{"index":{"_index":"customers","_type":"customer"}}
{"id": "4", "category" : "bb", "email": "kim#test.com", "age": "42", "address": "london"}
The idea is to have in Kibana visualize a data table like :
+----------+-------+-------+-----+---------+
| category | total | email | age | address |
+----------+-------+-------+-----+---------+
| aa | 3 | 2 | 2 | 0 |
| bb | 1 | 1 | 1 | 1 |
+----------+-------+-------+-----+---------+
(eg: we have 3 customers in category "aa"; among them 2 gave their email, 2 gave their age, none gave its address)
I can figure out how to do that with a query like:
POST /customers/_search?size=0
{
"aggs": {
"category": {
"terms": {
"field": "category"
},
"aggs": {
"count_email": {
"filter": {
"exists": {
"field": "email"
}
}
},
"count_age": {
"filter": {
"exists": {
"field": "age"
}
}
},
"count_address": {
"filter": {
"exists": {
"field": "address"
}
}
}
}
}
}
}
But I can't find how we can do that in Kibana Visualize.
Should I use scripted fields ? JSON inputs ? how ? is there a better way ?
Thanks for your advices.
In the UI I was able to split the rows using the keyword term filter.
Below is a url to get you started.
The call will create a data table, aggregate by count and split rows by category keyword term.
http://localhost:5601/app/kibana#/visualize/create?type=table&indexPattern=customers&_g=()&_a=(filters:!(),linked:!f,query:(query_string:(analyze_wildcard:!t,query:'*')),uiState:(vis:(params:(sort:(columnIndex:!n,direction:!n)))),vis:(aggs:!((enabled:!t,id:'1',params:(),schema:metric,type:count),(enabled:!t,id:'2',params:(field:category.keyword,order:desc,orderBy:_term,size:2),schema:bucket,type:terms)),listeners:(),params:(perPage:10,showMeticsAtAllLevels:!f,showPartialRows:!f,showTotal:!f,sort:(columnIndex:!n,direction:!n),totalFunc:sum),title:'CategoryTable',type:table))

Elasticsearch OR filtered query does not return results

I have the following data set:
{
"_index": "myIndex",
"_type": "myType",
"_id": "220005",
"_score": 1,
"_source": {
"id": "220005",
"name": "Some Name",
"type": "myDataType",
"doc_as_upsert": true
}
}
Doing a direct match query like so:
GET typo3data/destination/_search
{
"query": {
"match": {
"name": "Some Name"
}
},
"size": 500
}
Will return the data just fine:
"hits": {
"total": 1,
"max_score": 3.442347,
"hits": [...
Doing an OR-query however (I am not sure which syntax is correct, the first syntax is taken from elasticsearch docs, the second is a working query taken from another project with the same versions):
GET typo3data/destination/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"or": {
"filters": [
{
"term": {
"name": "Some Name"
}
}
]
}
}
}
},
"size": 500
}
or
{
"query":
{
"match_all": {}
},
"filter":
{
"or":
[
{ "term": { "name": "Some Name"} },
{ "term": { "name": "Some Other Name"} }
]
},
"size": 1000
}
Does not return anything.
The mapping for the name field is:
"name": {
"type": "string",
"index": "not_analyzed"
}
Elasticsearch version is 1.4.4.
When indexing "some name" , this is broken into tokens as follows -
"some name" => [ "some" , "name" ]
Now in a normal match query , it also does the same above process before matching result. If either "same" or "name" is present , that document is qualified as result
match query ("some name") => search for term "some" or "name"
The term query does not analyze or tokenize your query. This means that it looks for a exact token or term of "some name" which is not present.
term query ("some name") => search for term "some name"
Hence you wont be seeing any result.
Things should work fine if you make the field not_analyzed , but then make sure the case is also matching,
You can read more about the same here.
After extending our mapping to include every field we have:
PUT typo3data/_mapping/destination
{
"someType": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string",
"index": "not_analyzed"
},
"parentId": {
"type": "integer"
},
"type": {
"type": "string"
},
"generatedUid": {
"type": "integer"
}
}
}
}
The or-filters were working. So the general answer is: If you have such a problem, check your mappings closely and rather do too much work on them than too little.
If someone has an explanation why this might be happening, I will gladly pass the answer mark on to it.

How to get the number of hits of several matching fields in one record?

I have records similar to
{
"who": "John",
"hobby": [
{"name": "gardening",
"skills": 2
},
{"name": "sleeping",
"skills": 3
},
{"name": "darts",
"skills": 2
}
]
}
,
{
"who": "Mary",
"hobby": [
{"name": "gardening",
"skills": 2
},
{"name": "volleyball",
"skills": 3
},
{"name": "kung-fu",
"skills": 2
}
]
}
I am looking at building a query which would answer the question: "how many hobbies with skills=2 do we have?"
The answer for the example above would be 3 ("gardening" is common to both, and each have another unique one).
Every "query" or "query"+"aggs" I tried returns in ['hits']['hits'] or ['aggregations']['sources']['buckets'] the number of matching documents, that is two in the case above (one for "John" and one for "Mary", each of them satisfying the query).
Is there a way to build a query so that it returns the total number of fields (in the example above: the elements of the list "hobby") which matched that query? (fields, not documents)
Note: If my documents were flat:
{"who": "John", "name": "gardening", "skills": 2},
{"who": "John", "name": "sleeping", "skills": 3},
(...)
{"who": "Mary", "name": "kung-fu", "skills": 2}
then a simple "query" to match "skills": 2 + an aggregation on "name" would have done the work
Yes, you can achieve this with the nested type and using inner_hits and/or nested aggregations.
So here is the mapping you should use:
curl -XPUT localhost:9200/hobbies -d '{
"mappings": {
"hob": {
"properties": {
"who": {
"type": "string"
},
"hobby": {
"type": "nested", <--- the hobby list is of type nested
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"skills": {
"type": "integer"
}
}
}
}
}
}
}
Then we can insert your two sample documents using the _bulk endpoint like this:
curl -XPOST localhost:9200/hobbies/hob/_bulk -d '
{"index":{}}
{"who":"John", "hobby":[{"name": "gardening","skills": 2},{"name": "sleeping","skills": 3},{"name": "darts","skills": 2}]}
{"index":{}}
{"who":"Mary", "hobby":[{"name": "gardening","skills": 2},{"name": "volley-ball","skills": 3},{"name": "kung-fu","skills": 2}]}
'
And finally, we can query your index for how many hobbies have skills: 2 like this:
curl -XPOST localhost:9200/hobbies/hob/_search -d '{
"_source": false,
"query": {
"nested": {
"path": "hobby",
"query": {
"term": {
"hobby.skills": 2
}
},
"inner_hits": {} <---- this will return only the matching nested fields with skills=2
}
},
"aggs": {
"hobbies": {
"nested": {
"path": "hobby"
},
"aggs": {
"skills": {
"filter": {
"term": {
"hobby.skills": 2
}
},
"aggs": {
"by_field": { <--- this will return a breakdown of the fields with skills=2
"terms": {
"field": "name"
}
}
}
}
}
}
}
}'
What this query will return you is
In the hits part, the four fields that have skills: 2
In the aggs part, a breakdown of the 3 distinct fields which have skills: 2

Nested filtering in elasticsearch with more than one term of the same nested type

I'm new to elasticsearch, so maybe my approach is plain wrong, but I want to make an index of recipes and allow the user to filter it down with the aggregated ingredients that are still found in the subset.
Maybe I'm using the wrong language to explain so maybe this example will clarify. I would like to search for recipes with the term salt; which results in three recipes:
with ingredients: salt, flour, water
with ingredients: salt, pepper, egg
with ingredients: water, flour, egg, salt
The aggregate on the results ingredients returns salt, flour, water, pepper, egg. When I filter with flour I only want recipe 1 and 3 to appear in the search results (and the aggregate on ingredients should only return salt, flour, water, egg and salt). When I add another filter egg I want only recipe 3 to appear (and the aggregate should only return water, flour, egg, salt).
I can't make the latter to work: one filter next to the default query does narrow down the results as desired but when adding the other term (egg) to the terms filter the results again start to include b as well, as if it were an OR filter. Adding AND however to the filter execution results in NO results ... what am I doing wrong?
My mapping:
{
"recipe": {
"properties": {
"title": {
"analyzer": "dutch",
"type": "string"
},
"ingredients": {
"type": "nested",
"properties": {
"name": {
"type": "string",
"analyzer": "dutch",
"include_in_parent": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
My query:
{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"match": {
"_all": "salt"
}
}
]
}
},
"filter": {
"nested": {
"path": "ingredients",
"filter": {
"terms": {
"ingredients.name": [
"flour",
"egg"
],
"execution": "and"
}
}
}
}
}
},
"size": 50,
"aggregations": {
"ingredients": {
"nested": {
"path": "ingredients"
},
"aggregations": {
"count": {
"terms": {
"field": "ingredients.name.raw"
}
}
}
}
}
}
Why are you using a nested mapping here? Its main purpose is to keep relations between the sub-object attributes, but your ingredients field has just one attribute and can be modeled simply as a string field.
So, if you update your mapping like this :
POST recipes
{
"mappings": {
"recipe": {
"properties": {
"title": {
"type": "string"
},
"ingredients": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
You can still index your recipes as :
{
"title":"recipe b",
"ingredients":["salt","pepper","egg"]
}
And this query gives you the result you are waiting for :
POST recipes/recipe/_search
{
"query": {
"filtered": {
"query": {
"match": {
"_all": "salt"
}
},
"filter": {
"terms": {
"ingredients": [
"flour",
"egg"
],
"execution": "and"
}
}
}
},
"size": 50,
"aggregations": {
"ingredients": {
"terms": {
"field": "ingredients"
}
}
}
}
which is :
{
...
"hits": {
"total": 1,
"max_score": 0.22295055,
"hits": [
{
"_index": "recipes",
"_type": "recipe",
"_id": "PP195TTsSOy-5OweArNsvA",
"_score": 0.22295055,
"_source": {
"title": "recipe c",
"ingredients": [
"salt",
"flour",
"egg",
"water"
]
}
}
]
},
"aggregations": {
"ingredients": {
"buckets": [
{
"key": "egg",
"doc_count": 1
},
{
"key": "flour",
"doc_count": 1
},
{
"key": "salt",
"doc_count": 1
},
{
"key": "water",
"doc_count": 1
}
]
}
}
}
Hope this helps.

Resources