Related search words in Elasticsearch - elasticsearch

I'm looking for a convenient way to search for related words to a term. For example, If I search for the word "washer", I should be getting related search terms like "dryer" with the lower score thank washer results, It means the washer documents must appear the first and then dryer documents. how can I do this functionality?

You need to build a synonym dictionary. Fortunately, we have machine learning models now, like "word2vec (neural net)", that can do this. You can try using open source gensim package for this.
The input to the model would be lots of text/info/articles that carries the word washer and dryer. once you train on this, you can find closest words that are related to "washer" and use these as synonym like dictionary.
At query time, look up this dictionary and expand the query with lower weight/boost for synonyms than the actual term.

Related

Elasticsearch show what is matched from a query

I'm implementing a sort of "natural language" search assistant. I have a form with a number of select fields. The list of options in each field can be pretty lengthy. So rather than having to select each item individually, I'm adding a text input box where people can just type what they're looking for and the app will suggest possible searches, based on the options in the select dropdowns.
Let's say my options are:
Color: red, blue, black, yellow, green
Size: very small, kinda medium, super large
Shape: round, square, oblong, cylindrical
Year: 2007, 2008, 2009, 2010
If you typed in "2007 very small star-spangled", the text input would suggest "Search all 2007 very small widgets for 'star-spangled'". It understood that "2007" and "very small" were select options in the form, and that "star-spangled" was not, and suggested a search where "2007" and "very small" are selected, and then left the "star-spangled" bit for a plaintext search.
What I'm working on right now is parsing the search query and picking out the bits that fit into the select fields. I have all the options in Elasticsearch. I was thinking of searching each type individually to see if it matches anything in the search query. That seems straightforward to me. I can easily find matches. However, I don't know which part of the query actually matches each type, which I need in order to find out that e.g. "star-spangled" is the part that didn't match options.
So, in the end, I need to know that only the "2007" substring matched the year, only the "very small" substring matched the size, and "star-spangled" didn't match anything.
My first thought is to split the query into word-grams (e.g. "2007", "2007 very", "2007 very small", "2007 very small star-spangled", "very", "very small", "very small star-spangled", "small", "small star-spangled", "star-spangled") and search each option for each gram. Then I would know for sure which gram matched. However, this could obviously get resource intensive pretty quickly. Also, I know Elasticsearch can do that sort of search internally much faster.
So what I really need is to be able to perform a search and, along with the results, get back which part of the original query actually matched. So if I searched, "2007 verr small" (intentional misspelling) and did a fuzzy search of sizes, passing the entire query string, and I get the "Very Small" size back as a result, it would indicate that "verr small" is the part of the query that matched that size.
Any idea of how to do that? Or possibly some other solutions?
I could do the search and parse the results to see which bits match the string. Though I could see that being resource intensive as well. And if I'm doing a fuzzy search, it wouldn't necessarily be clear which part of the query triggered a match in the result.
I was also thinking that highlighting might work for this, but I don't know enough about Elasticsearch to know for sure.
EDIT: I tested this out using highlighting. It's so close to working. The highlight field comes back with the part of the string that matches. However, it only shows the part of the result that matches. It doesn't show the part of the query that matches. So if I want to allow for fuzzy searches, the highlight field won't match the original query and I won't be able to tell which part of the query matched. For example, a query of "very smaal" will return the size "Very Small", but the highlight field will show <em>very</em> <em>small</em>, not <em>very</em> <em>smaal</em>.
There are 2 types of queries in Elasticsearch, Match Query and Filtered Query. Match query matches your term in the documents and find all the relevant documents with a relevance score. For example when you search for term: "help fixing javascript problem" you are interested in all documents which contain one or more of the search term.
On the other hand, when you are using Filtered Query, a document is either a match or not match... there is no relevance score here... as an example, you want all the products built in year "2007"... here you need to use a filtered query. All the product built in 2007 have the same score and all other years are excluded from the result.
In my opinion, your problem should be dealt with Filter Query...
When using filter query, normally each filter has its own corresponding input in the UI, consider the following screen-shot which is from ebay:
If I have understood your requirement correctly, you want to include all those filters in a single search-box. In my opinion, this is nearly impossible to implement because you have no way to parse user input and decide which word corresponds to which filter...
If you want to go down the filter path, it's better to introduce corresponding UI fields for each filter...
If you want to stick to a single search box, then don't implement the filter functionality and stick to Elasticsearch Multi-match query... you can match the input term across multiple fields but you won't be able to filter out (exclude) result instead you get a relevance score.

Google Search Appliance sort by metadata content

I'm trying to refine the search results received by my application by including the sort parameter in my HTTP requests. I've combed through the documentation here, but I can't find exactly what I'm looking for.
I'm searching for DOC filetypes, and I am able to sort by date or sort by metadata, as in alphabetizing by title, author, etc. I can also filter by whether or not the title contains certain keywords. What I want to do is to sort by whether or not the title contains certain keywords (these documents appearing first in the results), but to still keep the other results.
For example, with keywords [winter, Christmas, holiday] I could do a descending sort by the sum of inmeta:title~winter, inmeta:title~Christmas, inmeta:title~holiday and the top result might be
Winter holidays other than Christmas
followed by documents with one or two of the keywords, followed by documents that meet the other search parameters but contain no keywords.
Is this possible in GSA?
I finally achieved what I was trying to do, so figured I'd post in case it helps anyone else.
As far as I know, it is impossible to create a query with this capability, but with Google's Custom Search API, you can create a search engine with the desired keywords in the context file (by editing the XML file directly or by adding keywords through the CSE console). Then you can formulate the query as usual, but perform the search on your personalized engine.
https://developers.google.com/custom-search/docs/ranking

Elasticsearch, how to make a query that matches strings with spaces

I am in learning process of ElasticSearch and having hard time matching certain cases.
For example I have product name: "SkyProdigy 130" and I am trying to write a query that will match this product name when someone types "sky prodigy". Also, another example for the manufacturer "Magpul" I would like to be able to match even if someone type "mag pul", etc.
I have managed to make this work with fuzzy query, but I am looking for a more organic way to achieve this through analyzers and correct mappings.
Can someone recommend the best approach for this case?

Web search algorithm with multiple words

I want to use search from database on my website, so I think about effective algorithm to use.
For example if I try to search "Hello my name is xxx" I want to see results:
Hello my name is John
Hello my name is Peter
Hello mr. xxx
His name is Peter
He is here
So I want to search all data from database with part of this text and sort result by number of matching words.
I made algorithm but I am pretty scared that it's so complicated and slow:
I split search text into words and use SQL select with multiple like or commands. Then I save this results into list. Then I count up numbers of matched words in each result and sort it by this count.
Problem is that when I will try to search long text.
Should I use better algorithm or should I learn somethink about thinks like Sphinx
For the first two results, a simple regex search should be able to retrieve results like that.
For the later ones, you might consider using an existing searching library thing, like Google Search Appliance, which can be used to search database information.

Making a Database query "Intelligent"?

I have the following requirement.
I have a table with a column that contains the city names. I am going to implement a search option by City.
But the user may not enter the city name correctly.
Examples :
The city "Matara" is sometimes spelled as "Mathara".
The city "Nuwara Eliya" is sometimes written as "Nuwaraeliya"
I can keep the consistency on the database column but I want to return the hits even the end user uses an alternative word.
What is the approach I need to use to implement this effectively?
You should probably implement a string distance check like Levenshtein distance
More approaches can be found here: How do you implement a "Did you mean"?
I think the above problem can be sufficiently solved by using Levenshtein Distance, PHP Similar Text or JaroWinkler Similarity. All the approaches provided me the sufficiently correct results.
Edit Distance Tool
You want something like a phonetic search.
Several algotithm exists. You can get an overview here
The idea is to add a column to you table with the phonetic equivalent to your city,
and perform the search against this (after having performed the same function for the searched term).
Some RDBMS such as Oracle possess a pre-implemented SOUNDEX function, that could allow you to perform the search without the added column.

Resources