Partial and Full Phrase Match - elasticsearch

Say I have the sentence: "John likes to take his pet lamb in his Lamborghini Huracan more than in his Lamborghini Gallardo" and I have a dictionary containing "Lamborghini", "Lamborghini Gallardo" and "Lamborghini Huracan". What's a good way of extracting the bold terms, achieving the terms "Lamborghini Gallardo" and "Lamborghini Huracan" as phrase matches, and other partial matches "Lamborghini" and "lamb"? Giving preference to the phrase matches over individual keywords.
Elastic search provides exact term match, match phrase, and partial matching. Exact term would obviously not work here, and neither match phrase since the whole sentence is considered as phrase in this case. I believe partial match would be appropriate if I only had the keywords of interest in the sentence. Going through previous SO threads, I found proximity for relevance which seems relevant, although not sure if this is the 'best option' since requires setting a threshold. Or even if there are simpler / better alternatives than elasticsearch (which seems more for full text search rather than simple keyword matching to a database)?

It sounds like you'd like to perform keyphrase extraction from your documents using a controlled vocabulary (your dictionary of industry terms and phrases).
[Italicized terms above to help you find related answers on SO and Google]
This level of analysis takes you a bit out of the search stack into the natural-language processing stack. Since NLP tends to be resource-intensive, it tends to take place offline, or in the case of search-applications, at index-time.
To implement this, you'd:
Integrate a keyphrase extraction tool, into your search-indexing code to generate a list of recognized key phrases for each document.
Index those key phrases as shingles into a new Elasticsearch field.
Include this shingled keyphrase field in the list of fields searched at query-time — most likely with a score boost.
For a quick win tool to help you with controlled keyphrase extraction, check out KEA (written in java).
(You could also probably write your own, but if you're also hoping to extract uncontrolled key phrases (not in dictionary) as well, a trained extractor will serve you better. More tools here.)

Related

Are there any approaches/suggestiosn for classifying a keyword so the search space will be reduced in elasticsearch?

I was wondering is there any way to classify a single word before applying a search on elasticsearch. Let's say I have 4 indexes each one holds few millions documents about a specific category.
I'd like to avoid searching the whole search space each time.
This problem becomes more challenging since it's not a sentence,
The search query usually consists only a single or two words, so some nlp magic (Named-entity recognition, POS etc) can't be applied.
I have read few questions on stackoverflow like:
Semantic search with NLP and elasticsearch
Identifying a person's name vs. a dictionary word
and few more, but couldn't find an approach. are there any suggestions I should try?

How should I index and search on hyphenated words in English?

I'm using Elasticsearch to search over a fairly broad range of documents, and I'm having trouble finding best practices for dealing with hyphenated words.
In my data, words frequently appear either hyphenated or as compound words, e.g. pre-eclampsia and preeclampsia. At the moment, searching for one won't find the other (the standard tokenizer indexes the hyphenated version as pre eclampsia).
This specific case could easily be fixed by stripping hyphens in a character filter. But often I do want to tokenize on hyphens: searches for jean claude and happy go lucky should match jean-claude and happy-go-lucky.
One approach to solving this is in the application layer, by essentially transforming any query for hyphenated-word into hyphenated-word OR hyphenatedword. But is there any way of dealing with all these use cases within the search engine, e.g. with some analyzer configuration? (Assume that my data is large and varied enough that I can't manually create exhaustive synonym files.)
You can use a compound word token filter - hyphenation_decompounder should probably work decent enough.
It seems like your index consists of many domain specific words that isn't necessarily in a regular English dictionary, so I'd spend some time creating my own dictionary first with the words that are important to your domain. This can be based on domain specific literature, taxonomies, etc. The dictionary_decompounder is suitable for doing stuff like that.
This assumes that your question was relevant to Elasticsearch and not Solr, where the filter is named DictionaryCompoundWordTokenFilter instead.

Would my approach to fuzzy search, for my dataset, be better than using Lucene?

I want to implement a fuzzy search facility in the web-app i'm currently working on. The back-end is in Java, and it just so happens that the search engine that everyone recommends on here, Lucene, is coded in Java as well. I, however, am shying away from using it for several reasons:
I would feel accomplished building something of my own.
Lucene has a plethora of features that I don't see myself utilizing; i'd like to minimize bloat.
From what I understand, Lucene's fuzzy search implementation manually evaluates the edit distances of each term indexed. I feel the approach I want to take (detailed below), would be more efficient.
The data to-be-indexed could potentially be the entire set of nouns and pro-nouns in the English language, so you can see how Lucene's approach to fuzzy search makes me weary.
What I want to do is take an n-gram based approach to the problem: read and tokenize each item from the database and save them to disk in files named by a given n-gram and its location.
For example: let's assume n = 3 and my file-naming scheme is something like: [n-gram]_[location_of_n-gram_in_string].txt.
The file bea_0.txt would contain:
bear
beau
beacon
beautiful
beats by dre
When I receive a term to be searched, I can simply tokenize it in to n-grams, and use them along with their corresponding locations to read in to the corresponding n-gram files (if present). I can then perform any filtering operations (eliminating those not within a given length range, performing edit distance calculations, etc.) on this set of data instead of doing so for the entire dataset.
My question is... well I guess I have a couple of questions.
Has there been any improvements in Lucene's fuzzy search that I'm not aware of that would make my approach unnecessary?
Is this a good approach to implement fuzzy-search, (considering the set of data I'm dealing with), or is there something I'm oversimplifying/missing?
Lucene 3.x fuzzy query used to evaluate the Levenshtein distance between the queried term and every index term (brute-force approach). Given that this approach is rather inefficient, Lucene spellchecker used to rely on something similar to what you describe: Lucene would first search for terms with similar n-grams to the queried term and would then score these terms according to a String distance (such as Levenshtein or Jaro-Winckler).
However, this has changed a lot in Lucene 4.0 (an ALPHA preview has been released a few days ago): FuzzyQuery now uses a Levenshtein automaton to efficiently intersect the terms dictionary. This is so much faster that there is now a new direct spellchecker that doesn't require a dedicated index and directly intersects the terms dictionary with an automaton, similarly to FuzzyQuery.
For the record, as you are dealing with English corpus, Lucene (or Solr but I guess you could use them in vanilla lucene) has some Phonetic analyzers that might be useful (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone)
Lucene 4.0 alpha was just released, many things are easier to customize now, so you could also build upon it an create a custom fuzzy search.
In any case Lucene has many years of performance improvements so you hardly would be able to achieve the same perf. Of course it might be good enough for your case...

Is there a search algorithm/method that matches phrases?

I am trying to make a search tool that would search a small number of objects (about 1000, each with about 3 text fields I want to search) for a given phrase.
I was trying to find an algorithm that would rank the search results for me. Lots of topics lead to Fuzzy matching, and the Levenshtein distance algorithm, but that doesn’t seem appropriate for this case (for example, it would say the phrase “cats and dogs” is closer to “cars and cogs” than it is to “dogs and cats”).
Is there an algorithm/method dedicated to matching a search phrase against other blocks of text, and ranking the results according to things like the text being equal, the search phrase being contained, individual words being contained etc. I don’t even know what is normally appropriate.
I usually code in c#. I am not using a data base.
Look at Lucene. It can perform all sort of text indexing, return ranked results, and lots of other good stuff. There's an implementation in C#. It might be a bit overkill for your use case, but it's such an excellent and useful technology that you should really have a look into it, it's almost certain you will find good use for it during your career.

Algorithm for computing the relevance of a keyword to a short text (50 - 100 words)

I want to compute the relevance of a keyword to a short description text. What would be the best approach in terms of efficiency and ease of implementation. I am using C++?
Simple solution: Count the occurrences of the word in the text.
To do a good job though is a hard problem that companies like Google have been working on for years. If possible, you might want to take a look at using their technology
To expand, try the following:
Use a dictionary (e.g. WordNet to replace all synonyms with a common word
Detect similar words using Levenshtein distance
That's still only going to get you so far. You'll need to perform some natural language processing to truly understand what the description is about to distinguish between multiple texts containing the keyword the same number of times.
Refer to these previous Stack Overflow questions:
What are Useful Ranking Algorithms for Documents without Links (e.g. PDF, MS Documents, etc…)?
Algorithm for generating a 'top list' using word frequency.

Resources