I have at least a thousand documents and more will be coming containing title/page content. Think Google like site search. I'm experiencing a lot of noise while enabling fuzziness. At the same time, fuziness will be helpful in addressing user error etc.
I already have the indexing down, consuming all changes to pages real time.
I'm doing a bunch of conditional steps for matching and boosting them based on significance.
Initially we did a poc to convert the page contents into fixed length vectors and use BERT to query which did not have much of an improvement, so right now it's using pure es queries.
I know there is a lot that goes into searchability in elasticsearch, Do you happen to have any resources to look into improving site search experience with elasticsearch? I'm missing major basics/foundations of searchability which I would like to improve on.
I have gotten a recommendation from another post for https://www.manning.com/books/relevant-search which I plan to learn from when it is delivered.
Some things that I'm thinking and proposed by my team lead is to perform dyanamic queries based on the query the user makes. eg. If user searches a name(service to check if it is a name) use a query without fuzziness.
{
"query": {
"bool": {
"must_not":{
"match": {
"channel": "techhub"
}
},
"should": [
{
"match_phrase": {
"title": {
"query": message,
"slop": 1,
"boost": 10.0
}
}
},
{
"match": {
"title": {
"query": message,
"fuzziness": 1,
"minimum_should_match": "1<30%",
"boost": 5.0
}
}
},
{
"match": {
"title.edge_ngrams": {
"query": message,
"fuzziness": 1,
"minimum_should_match": "1<30%",
"boost": 3.0
}
}
},
{
"match_phrase": {
"plain_blob": {
"query": message,
"slop": 1,
"boost": 10.0
}
}
},
{
"match": {
"plain_blob": {
"query": message,
"fuzziness": 1,
"minimum_should_match": "1<30%",
"boost": 1.5
}
}
},
{
"match": {
"plain_blob.edge_ngrams": {
"query": message,
"fuzziness": 1,
"minimum_should_match": "1<30%",
"boost": 1.0
}
}
}
],
"minimum_should_match": 1
}
},
"size": 10,
"from": 0
}
Related
I have an elasticsearch index with the standard analyzer. I would like to perfom search queries containing multiple words, e.g. human anatomy. This search should be performed across several fields:
Title
Subject
Description
All the words in the query should be present in any of the fields (e.g. 'human' in title and 'anatomy' in description, etc.). If not all the words are present across these fields, the result shouldn't be returned.
Now, more importantly, I want to get fuzzy matches (for example, these queries should return approximately the same results as human anatomy:
human anatom
human anatomic
humanic anatomic
etc.
So fuzziness should apply to every word in the query.
As Elasticsearch doesn't support fuzziness for the multi-match cross-fields queries, I have been trying to achieve the desired behaviour this way:
{
"query": {
"bool" : {
"must": [
{
"query": {
"bool":
{
"should": [
{
"match": {
"title": {
"query": "human",
"fuzziness": 2,
}
}
},
{
"match": {
"description": {
"query": "human",
"fuzziness": 2,
}
}
},
{
"match": {
"subject": {
"query": "human",
"fuzziness": 2,
}
}
},
]
}
}
},
{
"query": {
"bool":
{
"should": [
{
"match": {
"title": {
"query": "anatomy",
"fuzziness": 2,
}
}
},
{
"match": {
"description": {
"query": "anatomy",
"fuzziness": 2,
}
}
},
{
"match": {
"subject": {
"query": "anatomy",
"fuzziness": 2,
}
}
},
]
}
}
},
]
}
}
}
The idea behind this code is the following: find the results where
either of the fields contains human (with 2-letter edit distance, e.g.: humane, humon, humanic, etc.)
and
either of the fields contains anatomy (with 2-letter edit distance, e.g.: anatom, anatomic, etc.).
Unfortunately, this code does not work and fails to retrieve a great number of relevant results. For example (the edit distance between each of the words in the two queries <= 2):
human anatomic – 0 results
humans anatomy – 21 results
How can I make fuzziness work within the given conditions? Recreating the index with n-gram is currently not an option, so I would like to make fuzziness work.
I'm writing some code to generate queries and I wondered if there was any one way of generating the queries that was kinder to the server.
So this query:
{
"from": 0,
"size": 10,
"query": {
"bool": {
"must": [
{
"match": {
"Text": {
"query": "Scooby Shaggy corridor",
"fuzziness": 1,
"operator": "AND"
}
}
}
]
}
}
}
is logically equivalent to this:
{
"from": 0,
"size": 10,
"query": {
"bool": {
"must": [
{
"match": {
"Text": {
"query": "Scooby",
"fuzziness": 1
}
}
},
{
"match": {
"Text": {
"query": "Shaggy",
"fuzziness": 1
}
}
},
{
"match": {
"Text": {
"query": "corridor",
"fuzziness": 1
}
}
}
]
}
}
}
but is either one easier for the server to process?
Or does it make no difference?
I realise this is a trivial example but could it make a difference with more complex queries?
If someone who knows a bit about how ElasticSearch behaves under the hood could make an observation I'd be grateful.
Thanks,
Adam.
Elasticsearch will rewrite itself your multi-term match query to the logical equivalent. see here for more details.
The match query is of type boolean. It means that the text provided is
analyzed and the analysis process constructs a boolean query from the
provided text.
But you should keep the multi-term match query and let elasticsearch do the job. Its more maintainable and you can control the rewriting thanks to the rewrite parameter ( see here )
I am using the following multi_match query in Elasticsearch and I am wondering if I can use fuzziness only for "friendly_name field". I have tried different things but doesn't seem to work. I am also wondering if it possible to use an analyzer to get a similar result as the fuzziness does:
"query": {
"multi_match": {
"query": "input query",
"fields": ["code_short", "code_word","friendly_name"],
"minimum_should_match": "2"
} }, "_source": ["code", "friendly_name"]
Any help would be appreciated. Thanks.
If you only need query on one field , you don't need multi match
"match": {
"name": {
"query": "your query",
"fuzziness": "1.5",
"prefix_length": 0,
"max_expansions": 100,
"minimum_should_match": "80%"
}
}
I don't believe that you can fully replace fuzziness, but you have 2 options to explore that might work for you. ngram filter or stemmer filter.
======
Well it wasn't very clear to me what you've intended. But you can do your query that way:
"query": {
"bool": {
"should": [
{
"match": {
"friendly_name": {
"query": "text",
"fuzziness": "1.5",
"prefix_length": 0,
"max_expansions": 100
}
}
},
{
"match": {
"code_word": {
"query": "text"
}
}
},
{
"match": {
"code_short": {
"query": "text"
}
}
}
],
"minimum_should_match" : 2
}
}
We are using Elasticsearch to search for the most relevant companies in a specific catalog. When we use the normal search term like lettering we get reasonable scores and can sort the results according to the score.
However, when we modify the search term before querying and make the "starred" version of it (e.g., *lettering*) to be able to search for substrings we get a score of 1.0 for every result. The search for substrings is a requirement in the project.
Any ideas on what could cause this relevance computation? The problem occurs only when a single term is used. We get comprehensible scores when we use two starred terms in combination (e.g., *lettering* *digital*).
EDIT 1:
Exemplary mapping (YAML, other properties are mapped in the same way, excepting boost which is different for each property):
elasticSearchMapping:
type: object
include_in_all: true
enabled: true
properties:
'keywords':
type: string
include_in_all: true
boost: 50
Query:
{
"query": {
"filtered": {
"query": {
"bool": {
"must": [{
"match_all": []
}, {
"query_string": {
"query": "*lettering*"
}
}]
}
},
"filter": {
"bool": {
"must": [{
"term": {
"__parentPath": "/sites/industrycatalog"
}
}, {
"terms": {
"__workspace": ["live"]
}
}, {
"term": {
"__dimensionCombinationHash": "d751713988987e9331980363e24189ce"
}
}, {
"term": {
"__typeAndSupertypes": "IndustryCatalog:Entry"
}
}],
"should": [],
"must_not": [{
"term": {
"_hidden": true
}
}, {
"range": {
"_hiddenBeforeDateTime": {
"gt": "now"
}
}
}, {
"range": {
"_hiddenAfterDateTime": {
"lt": "now"
}
}
}]
}
}
}
},
"fields": ["__path"],
"script_fields": {
"distance": {
"script": "doc['coordinates'].distanceInKm(51.75631079999999,14.332867899999997)"
}
},
"sort": [{
"customer.featureFlags.industrycatalog": {
"order": "asc"
}
}, {
"_geo_distance": {
"coordinates": {
"lat": "51.75631079999999",
"lon": "14.332867899999997"
},
"order": "asc",
"unit": "km",
"distance_type": "plane"
}
}],
"size": 999999
}
What you are doing is wildcard query, They fall under term level queries and by default constant score is applied.
Check the Lucene Documentation, WildcardQuery extends MultiTermQuery
You can also verify this with the help of explain api, you will something like this
"_explanation": {
"value": 1,
"description": "ConstantScore(company:lettering), product of:",
"details": [{
"value": 1,
"description": "boost"
}, {
"value": 1,
"description": "queryNorm"
}]
}
You can change this behavior with rewriting,
Try this, rewrite also works with query string query
{
"query": {
"wildcard": {
"company": {
"value": "digital*",
"rewrite": "scoring_boolean"
}
}
}
}
It has various options for scoring, see what fits your requirement.
EDIT 1, the reason you see score other than 1 for *lettering* *digital* is due to queryNorm, you can again check with explain api, If you look closely, all documents with both matches will have same score and documents with single match will have same score also.
P.S : leading wildcard is not recommended at all. You will get performance issues since it has to check against every single term in the inverted index. You might want to check edge ngram or ngram filter
Hope this helps!
In the elasticsearch doc for a bool query at this link:
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/query-dsl-bool-query.html
It doesn't say the containing structure. If I just use bool the way they have it, it's totally wrong. I need to surround this with some silly combination of query/filter/ filtered query. I'm not sure what is the correct way to form a json query in elastic. The documents seem to be completely contradictory in many places about what goes where and how. Any elasticsearch experts out there that know about how to properly form a query?
First of all, there is a "bool" query, and a "bool" filter, and they go in different places and do slightly different things. As a general rule, if you can use a filter do it (many of them can be cached, and are a little faster even if not). If you need a "match" then you need a query.
The example on the page you referenced could actually be used either way:
As a query:
POST /test_index/_search
{
"query": {
"bool": {
"must": {
"term": {
"user": "kimchy"
}
},
"must_not": {
"range": {
"age": {
"from": 10,
"to": 20
}
}
},
"should": [
{
"term": {
"tag": "wow"
}
},
{
"term": {
"tag": "elasticsearch"
}
}
],
"minimum_should_match": 1,
"boost": 1
}
}
}
Or as a filter (in a filtered query):
POST /test_index/_search
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": {
"term": {
"user": "kimchy"
}
},
"must_not": {
"range": {
"age": {
"from": 10,
"to": 20
}
}
},
"should": [
{
"term": {
"tag": "wow"
}
},
{
"term": {
"tag": "elasticsearch"
}
}
],
"minimum_should_match": 1
}
}
}
}
}
Also I totally get the frustration with the ES documents. I've been working with them for a couple of years now, and they don't seem to be getting any better. Maybe the people in charge of documentation just don't care all that much. The conspiracy theory view would be that bad documentation helps the company sell professional services.