elasticsearch phrase_prefix search - elasticsearch

I'm using query_string for full text search and using type to define how to behave with full_text search, one of types that I have to use is phrase_prefix, to return documents that have exact term...
here is my problem:
when I want to search for one word terms.. such as tea the most documents that returns is because of teacher, I know for resolving this issue I have to use phrase type... but when I use this type for one word terms, I reach to another issue, for example ui..
because the most of documents consists UI/UX word, in phrase type search these docs will not return..
so I have a query that must behave like a phrase_prefix but not all the times... and the problem is I don't know the exact times!
if anyone have any solution for my problem.. I'll be so thankful to share that with me.

One simple way to solve this problem would be to use both queries connecting them together with OR in boolean should query. In this case, you would retrieve results in both cases. This means, that's still when you will search for tea you would get both teacher and tea, but since you would have for tea both clauses being matched - you should get higher score for tea. Same would work for UI
Example of the query would be something like:
{
"query": {
"bool" : {
"should" : [
{
##query1
},
{
##query2
}
]
}
}
}
That of course isn't ideal, but at least it would make you going.
More information about should clauses - https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

Related

Elasticsearch: how to find out if a value matches any value in a list?

I just start learning Elasticsearch. My data has the company name and its website, and I have a list which contains all the domain aliases of a company. I am trying to write a query which can boost the record with the same website in the list.
My data looks like:
{"company_name": "Kaiser Permanente",
"website": "http://www.kaiserpermanente.org"},
{"company_name": "Kaiser Permanente - Urgent Care",
"website": "http://kp.org"}.
The list of domain aliases is:
["kaiserpermanente.org","kp.org","kpcomedicare.org", "kp.com"]
The actual list is longer than the above example. I've tried this query:
{
"bool": {
"should": {
"terms": {
"website": [
"kaiserpermanente.org",
"kp.org",
"kpcomedicare.org",
"kp.com"
],
"boost": 20
}
}
}
}
The query doesn't return anything because "terms" query is the exact match. The domain in the list and the url is similar but not the same.
What I except is the query should return the two records in my example. I think "match" can work, but I couldn't figure out how to match a value with any similar value in the list.
I found a similar question How to do multiple "match" or "match_phrase" values in ElasticSearch. The solution works but my alias list contains more than 50 elements. It would be very verbose if I wrote multiple "match_phrase" for each element. Is there a more efficient way like "terms" so that I could just pass in a list?
I'd appreciate if anyone can help me out with this, thanks!
What you are observing has been covered in many stackoverflow posts / ES docs - the difference between terms and match. When you store that info, I assume you are using the standard analyzer. This means when you push "http://kp.org", Elasticsearch indexes [ "http", "kp", "org" ] tokens broken out. However, when you use terms, it looks for "kp.org" but there was no such "kp.org" token to find matches for since that was broken down by the analyzer when indexing. match, however, will break down what you query for so that "kp.org" => [ "kp", "org" ] and it is able to find one or both. Phrase matching just requires the tokens to be next to each other which is probably necessary for what you need.
Unfortunately, there does not appear to be such an option that works like match but allows many values to match against like terms. I believe you have three options:
programmatically generate the query as described in the stackoverflow post that you referenced, which you noted would be verbose, but I think this might be just ok unless you have 1k aliases.
analyze the website field so that analysis transforms "http://www.kaiserpermanente.org" => "kaiserpermanente.org" and "http://kp.org" => "kp.org" for indexing. With this index time analysis approach, when querying, you can successfully use the terms filter. This might be fine given urls are structured and the use cases you outline only appear to be concerned with domains. If you do this, use multi fields to analyze one website value in multiple ways. It's nice to have Elasticsearch do this kind of work for you and not worry about it in your own code.
do this processing beforehand (before pushing data to ES) so that when you store data in elasticsearch, you store not only the website field, but also a domain, paths, and whatever else you need that you calculated beforehand. You get control at the cost of effort you have to put in.

Elasticsearch: optimizing query with filter and constant_score?

In a Udemy tutorial I came across this query here:
{ "query": { "bool": {
"must": {"match": {"genre": "Sci-Fi"}},
"must_not": {"match": {"title": "trek"}},
"filter:" {"range": {"year": {"gte": 2010, "lt": 2015}}}
}}}
I was wondering if it's possible to optimize it? I am thinking of two possible ways:
Putting "genre" in a filter context. But a movie might be of multiple genres, so I am not sure if working with type keyword and filter-term would work there.
Putting "must_not" in a filter context directly (without a bool) will not work, because filters as far as I understand do not allow "filtering out", only "filtering what to keep". But if I wrapped must_not in a constant_score or filter-bool, would the query be more performant? Or does ES automatically take care of such optimizations? I just don't understand why must_not is in the query and not filter context in the first place. Can something only partially not match and thus reduce the score only by a degree?
Regarding 1:
Moving the genre match to the filter context might speed it up a little bit (even though that depends on so many other factors), but you'll lose the ranking, which might or might not be important to you. In the end, use must when ranking is important or filter if it's not and your only goal is to match a document or not given some criteria.
Moreover, using type keyword will only get you "exact match" semantics, it might be what you want... or not, depending on how you're creating the queries (user input or controlled pick list)...
Regarding 2:
must_not is already in a filter context, so it doesn't get any simpler than what you already see. The filter context is made up of both filter + must_not.
One last thing I would add and I always add when someone asks about performance optimization: Premature optimization is the root of all evil so only do it when you are actually witnessing performance issues, never before.

ElasticSearch random score combined with boost?

I am building a iOS app with Firebase, and using ElasticSearch as a search engine to get more advanced queries.
I am trying to achieve a system where I can get a random record from the index, based on a query. I have already got this working using the "random_score" function with a seed.
So all documents should right now have equal chance of being selected. Is it possible to add a boost or something(sorry, I am new to ES)?
Let's say the the document has the field "boost_enabled" and it set to true, the document will be 3 times more likely to be selected, so "increasing" the chance of being selected as a random?
So in theory it should look like this:
Documents that matches the query:
"document1"
"document2"
"document3"
They all have an equal chance of being selected (33%)
What I wish to achieve is if "document1" has the field "boost_enabled" = true
It should look like this:
"document1"
"document1"
"document1"
"document2"
"document3"
So now "document1" is 3 times more likely to be selected as the random record.
Would really appreciate some help.
EDIT:
I've come up with something like this, is this correct or not? I am pretty sure it's not though...
"query" : {
"function_score": {
"query": {
"bool" : {
"must": {
"match_all": {}
},
"should": [
{ "exists" : {
"field" : "boost_enabled",
"boost" : 3
}
}
]
"filter" : filterArray
}
},
"functions": [
{
"random_score": {"seed": seed}
}
]
}
}
/ Mads
Yes, Elasticsearch has something like that - refer to Elasticsearch: Query-Time Boosting.
In your case, you would have a portion of your query that notes the presence of the flag you described and this "subquery" would have a boost. bool with its should clause will probably be useful.
NB: This is not EXACTLY like being able to say matching document is n times as likely to be a result
EDITS:
--
EDIT 1:
Elasticsearch will tell you how it comes up with the score via the
Explain API which might be helpful in tweaking parameters.
--
EDIT 2:
I apologize for what I had posted above. Upon further thought and exploration, I think the boost parameter is not quite what is required here. function_score already has the notion of weight but even that falls short. I have found other users with requirements similar to yours but it looks like there haven't been any good solutions proposed for this.
References:
Elasticsearch Github Issue on Weighted Random Sampling
Stackoverflow Post with a Request Identical to Github Issue
I do not think the solutions proposed in those posts are quite right. I put together a quick shell script hitting the Elasticsearch REST API and relying on jq (a popular CLI for processing JSON) to demonstrate: Github Gist: Flawed Attempt At Weighed Random Sampling with Elasticsearch
In the script, featured_flag is equivalent to your boost_enabled, and undesired_flag is there to demonstrate how to only consider a subset of documents in the index. You can copy the script tweak global variables at the top of the script like Elasticsearch server, index, etc to try it out.
Some notes on the script:
script creates one document with featured_flag enabled and one document with undesired_flag enabled that should not be ever chosen
TOTAL_DOCUMENTS can be used to adjust how many total documents are created (including the first two created)
FEATURED_FLAG_WEIGHT is the weight applied at query time via function_score
script reruns the same query 1000 times and outputs stats on how many times each of the created documents was returned as the first result
I would imagine your index has many "featured" or "boosted" samples among many that are not. With the described requirements, the probability of choosing a sample depends on weight of the document (let's say 3 for boosted documents, 1 for the rest) and the sum of weights across all valid documents that you want taken into consideration. Therefore, it seems like simple weights, boosts, and randoms are just insufficient
A lot of people have considered and posted solutions for the task of weighted random sampling without Elasticsearch. This appears to be a good stab at explaining a few approaches: electric monk: Weighted Random Distribution. A lot of algorithmic details may not be quite relevant here but I thought they were interesting.
I think the ideal solution would require work to be done outside of Elasticsearch (without delving into creating Elasticsearch plugins, scorers, etc). Here is the best that I can come up with at the moment:
A numeric weight field stored in the documents (can continue with boolean fields but this seems more flexible)
Hit Elasticsearch with an initial query leveraging aggregations for some stats we need
Possibly a sum aggregation for the sum of weights required for document probabilities
A terms aggregation to get counts of documents by weights (ex: m documents with weight 1, n documents with weight 3)
Outside of Elasticsearch (in the app), choose the sample
generate a random number within the range of 0 to sum_of_weights-1
use the aggregation results and the generated random to select an index (see the algorithmic solutions for weighted random sampling outside of Elasticsearch) that is in the range of 0 to total_valid_documents-1 (call this selected_index)
Hit Elasticsearch a second time with appropriate filters for considering only valid documents, a sort parameter that guarantees the document set is ordered the same way each time we run this process (perhaps sorted by weight and by document id), and a from parameter set to the selected_index
Slightly related to all this, I posted a slightly different write up.
In ES 7.15 i use {script_score} key, below you can see my example.
This code "source": "_score + Math.random()" added random 0.0 -> 1.0 number to my native boosted score. For more information you can see this
{
"size": { YOUR_SIZE_LIMIT },
"query": {
"script_score": {
"query": {
"query_string": {
"fields": [
"{ YOUR_FIELD_1 }^6",
"{ YOUR_FIELD_2 }^3",
],
"query": "{ YOUR_SEARCH_QUERY }"
}
},
"script": {
"source": "_score + Math.random()"
}
}
}
}

How to deal with too many irrelevant results for a elasticsearch query?

I'm trying to implement a product search for a German e-commerce site and have great trouble to find the right resources about specific problems.
I had the problem that searches for a partial word wouldn't return viable results, e.g. match for etikett wouldn't result in documents containing Rolletiketten.
Ngrams introduced too many problems, so I got rid of them again after some tests. I found out about word decomposition for the German language and tried some plugins. Now I'm getting far too many completely irrelevant results, e.g. searching for rolletikett returns documents containing möbelrollen, which is something completely different.
While I understand most of the mechanics and why I get those results, I have no clue how to fix my problems and as it seems I'm unable to find the right resources online to clear up some clouds.
A few hints would be awesome. Thank you.
With elasticsearch you should get the things you write out of the box (e.g. with a wildcard-search).
Maybe you are doing a boolean query that searches for whole words only.
I suggest the following links that go through the Query Language:
Introductional: http://logz.io/blog/elasticsearch-queries/
In detail: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
Hope that helps, Christian
Tip: The document-mapping and exact query you are submitting will be helpful for others to assist in solving your problem.
When you said that introducing ngrams is causing issues, I think you may have ended up placing too much pressure on indexing. Changing the min and max grams value can help in that.
For example, below is an analysis filter that I am using and performs well:
"autocomplete": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "10"
}
Here is a another problem on stack overflow, the problem statement is different but the solution is relevant for this problem too: https://stackoverflow.com/a/42592722/3133937
"Now I'm getting far too many completely irrelevant results"
Try using min_score: docs
Some of your ES queries may be broad enough that poor quality hits make it into your results. Just setting a threshold for score helps keep them at bay. For me, I had strong score 10 hits, then along with them were a ton of score 0 hits; unwanted. If you see this, I'd guess that your query could be more efficient, but at least min_score keeps the fluff down.
GET /myIndex/_search
{
"from" : 0,
"size" : 10,
"min_score": 1,
"query" : {
"match": {
"Title": {
"query": "Bake a Cake",
"fuzziness": 2
}
}
}
}
}

Elastic Search Custom char_filter analyzer not working as (I) expected

I have added the following custom analyzer to my elastic search index.
{
"index" : {
"analysis" : {
"char_filter" : {
"acc_mapping" : {
"type" : "mapping",
"mappings" : ["/=>"]
}
},
"analyzer" : {
"accno_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["acc_mapping"]
}
}
}
}
}
Essentially we have a lot of data feeds and legacy data and the account number can vary in format. The current (and majority of data) format is along the lines of acc/vb123/123 and this was causing problems in that it was being tokenised into acc, vb123 and 123. I have a hack solution (in my java app) where if only the account number is searched on I 'hard code' an exact match search in i.e. acc/vb123/123 becomes \"acc/vb123/123\". This is not something I'm proud of but it does the job in most cases (people either tend to search for account number OR text so my hack hasn't become too obvious yet...) I am not an expert in elastic search but I am intermittently trying to put a proper solution in place.
The mapping above has cut down the results from thousands to hundreds but I was expecting it to cut down to 1 in most cases. On investigation what seems to be happening is searching for acc/vb123/123 is no longer picking up other accounts in this format (starting with acc or ending with 123 for example) but it is still picking up other fomats with a - in so for example searching on acc/vb123/123 will also return oldref-123.
Using postman my crude test search is
{
"query" : {
"filtered" : {
"query" : {
"query_string" : {
"query" : "acc/vb123/123",
"fields" : [ "customer.accno" ],
"analyzer": "accno_char_filter"
}
}
}
}
}
the mapping file used to build the index contains
"accno": {
"type": "string",
"analyzer": "accno_char_filter",
"store": true
},
I thought this would mean as far as the index and search were concerned the field stored and searched for would be 1 block of text "accvb123123" and thus would not be tokenised or analyzed into any sub components.
I appear to be wrong as it is picking up accounts with a -, like oldref-123 as I explained. I'm guessing this may be to do with using "tokenizer" : "standard" but I wondered if anyone could explain why this would be happening before I start ripping it apart (or adding a ["-=>"] mapping to see if that helps without breaking anything!)
Thanks in advance.
Edit:
The actual application offers the user one text entry box and uses the content to search 3 fields. One field is account number, 1 field is name and one field is description. Description can be a long section very vague text. It can contain account numbers of related accounts (although usually doesn't). In general the search works perfectly (or close to) out of the box. The only issue is that the way it is designed someone may enter an account number in the text box, expecting to see the exact match from the account number field (and any cases where it is in the description field if it is) or they may enter some general text "credit problems Northampton" where they would expect to see results of credit AND problems AND Northampton. The issue is that the way it is designed if they enter an account number they thus get thousands of results with an out of the box set up as once tokenised the account number search say acc/ab12/123 searches for acc, ab12 and 123. I've been slowly and gradually looking for a way to stop acc/ab12/123 being tokenised in any of the 3 fields and so that one could search for acc/ab12/123 and only get 1 result (account number = acc/ab12/123) without stopping the possibility of a proper text search.
This is something I have worked on gradually as time allows as there are plenty of work-arounds (even disregarding the hack I have put in place) - The user can place quotes around the search text and search for ["acc/ab12/123"] or even go to "options" and select to only search the account number field. Also the most relevant matches appear first so it isn't like thousands of results makes the application unusable. From a dev/support perspective this is a non problem (!) but the business get upset when they see thousands of results when searching an account number using the default search options and and I would like to learn elasticsearch customisation better and so I am gradually trying to customise it according to their requirement.

Resources