Are there any approaches/suggestiosn for classifying a keyword so the search space will be reduced in elasticsearch? - elasticsearch

I was wondering is there any way to classify a single word before applying a search on elasticsearch. Let's say I have 4 indexes each one holds few millions documents about a specific category.
I'd like to avoid searching the whole search space each time.
This problem becomes more challenging since it's not a sentence,
The search query usually consists only a single or two words, so some nlp magic (Named-entity recognition, POS etc) can't be applied.
I have read few questions on stackoverflow like:
Semantic search with NLP and elasticsearch
Identifying a person's name vs. a dictionary word
and few more, but couldn't find an approach. are there any suggestions I should try?

Related

ElasticSearch / Lucene: automata for the dictionary of terms

Issue: I need to highlight matched terms. Out-of-the-box solution cannot be applied due to the fact we don't keep sources inside ES.
Possible solution:
Retrieve ids from ES by search query
Retrieve sources by ids
Match source with query word by word using LevinsteinDistance algorithm or lucene FSM class
Considering we don't retrieve a lot of content at a time it should not consume a lot of time.
The question is the following:
Does Lucene library contain FSM/automata to represent a dictionary? The desired solution: to get lucene automata representing the dictionary and feed the query to it term by term. Automata should accept terms that are contained in the dictionary. Edit Distance should be considered as well.
Searching for the solution I found lucene classes like LevenshteinAutomata and FuzzyQuery. But LevenshteinAutomata (as I understood) represents only one term. So for several terms I need several automata.

How should I index and search on hyphenated words in English?

I'm using Elasticsearch to search over a fairly broad range of documents, and I'm having trouble finding best practices for dealing with hyphenated words.
In my data, words frequently appear either hyphenated or as compound words, e.g. pre-eclampsia and preeclampsia. At the moment, searching for one won't find the other (the standard tokenizer indexes the hyphenated version as pre eclampsia).
This specific case could easily be fixed by stripping hyphens in a character filter. But often I do want to tokenize on hyphens: searches for jean claude and happy go lucky should match jean-claude and happy-go-lucky.
One approach to solving this is in the application layer, by essentially transforming any query for hyphenated-word into hyphenated-word OR hyphenatedword. But is there any way of dealing with all these use cases within the search engine, e.g. with some analyzer configuration? (Assume that my data is large and varied enough that I can't manually create exhaustive synonym files.)
You can use a compound word token filter - hyphenation_decompounder should probably work decent enough.
It seems like your index consists of many domain specific words that isn't necessarily in a regular English dictionary, so I'd spend some time creating my own dictionary first with the words that are important to your domain. This can be based on domain specific literature, taxonomies, etc. The dictionary_decompounder is suitable for doing stuff like that.
This assumes that your question was relevant to Elasticsearch and not Solr, where the filter is named DictionaryCompoundWordTokenFilter instead.

Partial and Full Phrase Match

Say I have the sentence: "John likes to take his pet lamb in his Lamborghini Huracan more than in his Lamborghini Gallardo" and I have a dictionary containing "Lamborghini", "Lamborghini Gallardo" and "Lamborghini Huracan". What's a good way of extracting the bold terms, achieving the terms "Lamborghini Gallardo" and "Lamborghini Huracan" as phrase matches, and other partial matches "Lamborghini" and "lamb"? Giving preference to the phrase matches over individual keywords.
Elastic search provides exact term match, match phrase, and partial matching. Exact term would obviously not work here, and neither match phrase since the whole sentence is considered as phrase in this case. I believe partial match would be appropriate if I only had the keywords of interest in the sentence. Going through previous SO threads, I found proximity for relevance which seems relevant, although not sure if this is the 'best option' since requires setting a threshold. Or even if there are simpler / better alternatives than elasticsearch (which seems more for full text search rather than simple keyword matching to a database)?
It sounds like you'd like to perform keyphrase extraction from your documents using a controlled vocabulary (your dictionary of industry terms and phrases).
[Italicized terms above to help you find related answers on SO and Google]
This level of analysis takes you a bit out of the search stack into the natural-language processing stack. Since NLP tends to be resource-intensive, it tends to take place offline, or in the case of search-applications, at index-time.
To implement this, you'd:
Integrate a keyphrase extraction tool, into your search-indexing code to generate a list of recognized key phrases for each document.
Index those key phrases as shingles into a new Elasticsearch field.
Include this shingled keyphrase field in the list of fields searched at query-time — most likely with a score boost.
For a quick win tool to help you with controlled keyphrase extraction, check out KEA (written in java).
(You could also probably write your own, but if you're also hoping to extract uncontrolled key phrases (not in dictionary) as well, a trained extractor will serve you better. More tools here.)

Finding keywords in a set of small texts

I have a set of almost 2000 texts.
My goal is to find the keywords across these texts to understand what is the subject of them, or simply the most common words and expressions.
I would like some ideias of algorithms to score the words and identify when they frequently come together.
I have read some other related questions here, but I'm trying to get more and more information about this subject. So any ideas are very welcome. Thank you so much!
--
I have already extracted stopwords. After removing them I have more than 7000 words remaing; My question is how to score these words and from which point I can consider removing some them from my list of keywords. Also, how to get key expressions, find words that come together.
You may want to refer to a classical text on Information Retrieval. Most of the algorithms use a stop list to remove commonly occurring words such as "for" and "the", and then, extract the base or root word (change "seeing", "seen", "see", "sees" to the base word "see"). The remaining words form the keywords of the document and are weighted by things like term frequency (how many times the word occurs in the document) and inverse document frequency (how unique is the word in describing the content). You can use the weighted keywords as document representation and use them for retrieval.
You can use the Lucene MoreLikeThis implementation, which extracts a list of top most important keywords from a given text document. The term scoring function it uses is the tf-idf, i.e. it chooses those terms with topmost tf-idf scores, i.e. the terms which are relatively uncommon and occur frequently in the document.
If efficiency is an issue, it employs some common heuristics as follows.
Since you're trying to maximize a tf*idf score, you're probably most interested in terms with a high tf. Choosing a tf threshold even as low as two or three will radically reduce the number of terms under consideration. Another heuristic is that terms with a high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the number of characters, not selecting anything less than, e.g., six or seven characters. With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms that do a pretty good job of characterizing a document.
More details can be found in this javadoc.

Is there a search algorithm/method that matches phrases?

I am trying to make a search tool that would search a small number of objects (about 1000, each with about 3 text fields I want to search) for a given phrase.
I was trying to find an algorithm that would rank the search results for me. Lots of topics lead to Fuzzy matching, and the Levenshtein distance algorithm, but that doesn’t seem appropriate for this case (for example, it would say the phrase “cats and dogs” is closer to “cars and cogs” than it is to “dogs and cats”).
Is there an algorithm/method dedicated to matching a search phrase against other blocks of text, and ranking the results according to things like the text being equal, the search phrase being contained, individual words being contained etc. I don’t even know what is normally appropriate.
I usually code in c#. I am not using a data base.
Look at Lucene. It can perform all sort of text indexing, return ranked results, and lots of other good stuff. There's an implementation in C#. It might be a bit overkill for your use case, but it's such an excellent and useful technology that you should really have a look into it, it's almost certain you will find good use for it during your career.

Resources