In elasticsearch, is there some method to reduce the importance of a set of search terms? - elasticsearch

Ideally, I would like to reduce the importance of certain words such as "store", "shop", "restaurant".
I would like "Jimmy's Steak Restaurant" to be about as important as "Ralph's Steak House" when a user searches for "Steak Restaurant". I hope to accomplish this by severely diminishing the importance of the word "restaurant" (along with 20-50 other words).
Stop words work well for some words, such as "a", "the", "of", etc, but they are all-or-nothing.
Is there a way to provide a weighting or boost value per word at the index or mapping level?
I can probably accomplish this at the query level, but that could be very bad if I have 50 words whose impact I need to reduce.
This was a generalized example. In my actual complex solution, I really do need to reduce the impact of quite a few search terms.

I don't believe it is possible to specify a term-level boost during indexing. In this thread, Shay mentions that it is possible in Lucene, but that it's a tricky feature to surface through the API.
Another relevant thread, suggesting the same thing. Shay recommends trying to sort it out using a custom_score query:
I think that you should first try and solve it on the search side. If you know the weights when you do a search, you can either construct a query that applies different boosts depending the tag, or use custom_score query.
Custom_score query is slower than other queries, but I suggest you run and check if it's ok for you (with actual data, and relevant index size). The good thing is that if its slow for you (and slow here means both latency and QPS under load), you can always add more replicas and more machines to separate the load.
Here is an example of a custom_score query that boosts on a somewhat-similar term level (except it's for a special field that only has one category term, so this may not apply). It might be easier to break the script out into a native script, instead of using mvel, since you'll have a big list of words.
As an alternative, perhaps add a synonym token filter that interchanges words like "shop", "restaurant", "store", etc?

Related

Get the most frequent terms of text field

How do I get a list of all individual tokens of a text field along with their document frequency. I want this to build a domain specific list of frequent (and therefore useless) stop words.
This question covers all the methods I found so far but
"keyword" data type is not an option because im interested in individual terms (so tokenisation is necessary)
"significant term aggregation" is not an option because im interested in the most frequent, not the most significant terms
"termvector" is not an option because I need it for the hole index not just a particular document or a small subset.
You will have to enable field_data on your field to do this.
But be careful it can impact a lot the heap memory used.
https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html

Partial and Full Phrase Match

Say I have the sentence: "John likes to take his pet lamb in his Lamborghini Huracan more than in his Lamborghini Gallardo" and I have a dictionary containing "Lamborghini", "Lamborghini Gallardo" and "Lamborghini Huracan". What's a good way of extracting the bold terms, achieving the terms "Lamborghini Gallardo" and "Lamborghini Huracan" as phrase matches, and other partial matches "Lamborghini" and "lamb"? Giving preference to the phrase matches over individual keywords.
Elastic search provides exact term match, match phrase, and partial matching. Exact term would obviously not work here, and neither match phrase since the whole sentence is considered as phrase in this case. I believe partial match would be appropriate if I only had the keywords of interest in the sentence. Going through previous SO threads, I found proximity for relevance which seems relevant, although not sure if this is the 'best option' since requires setting a threshold. Or even if there are simpler / better alternatives than elasticsearch (which seems more for full text search rather than simple keyword matching to a database)?
It sounds like you'd like to perform keyphrase extraction from your documents using a controlled vocabulary (your dictionary of industry terms and phrases).
[Italicized terms above to help you find related answers on SO and Google]
This level of analysis takes you a bit out of the search stack into the natural-language processing stack. Since NLP tends to be resource-intensive, it tends to take place offline, or in the case of search-applications, at index-time.
To implement this, you'd:
Integrate a keyphrase extraction tool, into your search-indexing code to generate a list of recognized key phrases for each document.
Index those key phrases as shingles into a new Elasticsearch field.
Include this shingled keyphrase field in the list of fields searched at query-time — most likely with a score boost.
For a quick win tool to help you with controlled keyphrase extraction, check out KEA (written in java).
(You could also probably write your own, but if you're also hoping to extract uncontrolled key phrases (not in dictionary) as well, a trained extractor will serve you better. More tools here.)

Lucene Which would be better: many queries or massive OR query?

Problem I have a large list of keywords that I want to see if the are contained in a document or documents. (My users want to know when a document is published, if it has any of their saved keywords)
So I could make many queries; one for each keyword.
Or I could construct a query something like: "coffee OR tea OR milk OR sugar OR beer"
Now lets say there are over 1,000 key words.
Which one is likely to lead to pain and suffering?
Would one be better over the other when running against one document or many documents?
(I am leaning towards the OR version but I am am worried I will hit some query length (performance) limit if I go too far)
Once I have enough data I will run some comparisons and report back.
Any hints between now and then would be great though.
Single Giant Query Pro: You get ranking by the Lucene's scoring algorithm for all of the keywords.
Single Giant Query Con: You make Lucene use a huge amount of memory, as it needs to remember each subquery's result (or part of it) in order to give you that nice ranking that takes all keywords into account. The bigger the OR query, the more memory Lucene needs to do it, and the slower it does it.
I'd say, if at all possible for your purposes, break it up, since OR queries are The Devil (even though it's sometimes necessary to deal with them); but benchmark should be better than asking random people for opinions :P

Would my approach to fuzzy search, for my dataset, be better than using Lucene?

I want to implement a fuzzy search facility in the web-app i'm currently working on. The back-end is in Java, and it just so happens that the search engine that everyone recommends on here, Lucene, is coded in Java as well. I, however, am shying away from using it for several reasons:
I would feel accomplished building something of my own.
Lucene has a plethora of features that I don't see myself utilizing; i'd like to minimize bloat.
From what I understand, Lucene's fuzzy search implementation manually evaluates the edit distances of each term indexed. I feel the approach I want to take (detailed below), would be more efficient.
The data to-be-indexed could potentially be the entire set of nouns and pro-nouns in the English language, so you can see how Lucene's approach to fuzzy search makes me weary.
What I want to do is take an n-gram based approach to the problem: read and tokenize each item from the database and save them to disk in files named by a given n-gram and its location.
For example: let's assume n = 3 and my file-naming scheme is something like: [n-gram]_[location_of_n-gram_in_string].txt.
The file bea_0.txt would contain:
bear
beau
beacon
beautiful
beats by dre
When I receive a term to be searched, I can simply tokenize it in to n-grams, and use them along with their corresponding locations to read in to the corresponding n-gram files (if present). I can then perform any filtering operations (eliminating those not within a given length range, performing edit distance calculations, etc.) on this set of data instead of doing so for the entire dataset.
My question is... well I guess I have a couple of questions.
Has there been any improvements in Lucene's fuzzy search that I'm not aware of that would make my approach unnecessary?
Is this a good approach to implement fuzzy-search, (considering the set of data I'm dealing with), or is there something I'm oversimplifying/missing?
Lucene 3.x fuzzy query used to evaluate the Levenshtein distance between the queried term and every index term (brute-force approach). Given that this approach is rather inefficient, Lucene spellchecker used to rely on something similar to what you describe: Lucene would first search for terms with similar n-grams to the queried term and would then score these terms according to a String distance (such as Levenshtein or Jaro-Winckler).
However, this has changed a lot in Lucene 4.0 (an ALPHA preview has been released a few days ago): FuzzyQuery now uses a Levenshtein automaton to efficiently intersect the terms dictionary. This is so much faster that there is now a new direct spellchecker that doesn't require a dedicated index and directly intersects the terms dictionary with an automaton, similarly to FuzzyQuery.
For the record, as you are dealing with English corpus, Lucene (or Solr but I guess you could use them in vanilla lucene) has some Phonetic analyzers that might be useful (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone)
Lucene 4.0 alpha was just released, many things are easier to customize now, so you could also build upon it an create a custom fuzzy search.
In any case Lucene has many years of performance improvements so you hardly would be able to achieve the same perf. Of course it might be good enough for your case...

Is there a search algorithm/method that matches phrases?

I am trying to make a search tool that would search a small number of objects (about 1000, each with about 3 text fields I want to search) for a given phrase.
I was trying to find an algorithm that would rank the search results for me. Lots of topics lead to Fuzzy matching, and the Levenshtein distance algorithm, but that doesn’t seem appropriate for this case (for example, it would say the phrase “cats and dogs” is closer to “cars and cogs” than it is to “dogs and cats”).
Is there an algorithm/method dedicated to matching a search phrase against other blocks of text, and ranking the results according to things like the text being equal, the search phrase being contained, individual words being contained etc. I don’t even know what is normally appropriate.
I usually code in c#. I am not using a data base.
Look at Lucene. It can perform all sort of text indexing, return ranked results, and lots of other good stuff. There's an implementation in C#. It might be a bit overkill for your use case, but it's such an excellent and useful technology that you should really have a look into it, it's almost certain you will find good use for it during your career.

Resources