I have the following requirement.
I have a table with a column that contains the city names. I am going to implement a search option by City.
But the user may not enter the city name correctly.
Examples :
The city "Matara" is sometimes spelled as "Mathara".
The city "Nuwara Eliya" is sometimes written as "Nuwaraeliya"
I can keep the consistency on the database column but I want to return the hits even the end user uses an alternative word.
What is the approach I need to use to implement this effectively?
You should probably implement a string distance check like Levenshtein distance
More approaches can be found here: How do you implement a "Did you mean"?
I think the above problem can be sufficiently solved by using Levenshtein Distance, PHP Similar Text or JaroWinkler Similarity. All the approaches provided me the sufficiently correct results.
Edit Distance Tool
You want something like a phonetic search.
Several algotithm exists. You can get an overview here
The idea is to add a column to you table with the phonetic equivalent to your city,
and perform the search against this (after having performed the same function for the searched term).
Some RDBMS such as Oracle possess a pre-implemented SOUNDEX function, that could allow you to perform the search without the added column.
Related
i'm facing a big problem in my SolR DB.
My objects have a datetime field "Available_From" and a datetime field "Available_To".
We also have a "Ranking" field for the sorting.
I can search correctly with direct queries (eg. give me all the items that are available at the moment) but when i do a regular search i cannot find a way to show the items that result "available NOW" in the first places in the results, usually sorted by "Ranking" field.
How can i do this? Am I forced to write some java classes (the nearest thing i've found is there https://medium.com/#devchaitu18/sorting-based-on-a-custom-function-in-solr-c94ddae99a12) or is there a way to do with standard SolR queries?
Thanks in advance to everyone!
In your case you actually don't want sorting, since that indicates that you want one field to determine the returned sequence of documents.
Instead, use boosting - apply a very large boost to those that are available now, either through bq or boost, then apply a boost based on ranking. You'll have to tweak the weights given to each part based on how you want the search results to be presented.
I'm looking for a convenient way to search for related words to a term. For example, If I search for the word "washer", I should be getting related search terms like "dryer" with the lower score thank washer results, It means the washer documents must appear the first and then dryer documents. how can I do this functionality?
You need to build a synonym dictionary. Fortunately, we have machine learning models now, like "word2vec (neural net)", that can do this. You can try using open source gensim package for this.
The input to the model would be lots of text/info/articles that carries the word washer and dryer. once you train on this, you can find closest words that are related to "washer" and use these as synonym like dictionary.
At query time, look up this dictionary and expand the query with lower weight/boost for synonyms than the actual term.
For a matchmaking portal, we have one requirement where in, if a customer viewed complete profile details of a bride or groom then we have to exclude that profile from further search results. Currently, along with other detail we are storing the viewed profile ids in a field (Comma Separated) against that bride or groom's details.
Eg., if A viewed B, then in B's record under the field saw_me we will add A (comma separated).
while searching let say the currently searching members id is 123456 then we will fire a query like
Select * from profiledetails where (OTHER CON) AND 123456 not in saw_me;
The problem here is the saw_me field value is growing like anything, is there any better way to handle this requirement? Please guide.
If this is using Solr:
first, DON'T add the 'AND NOT ...' clauses along with the main query in q param, add them to fq. This have many benefits (the fq will be cached)
Until you get to a list of values that is maybe 1000s this approach is simple and should work fine
After you reach a point where the list is huge, maybe it time to move to a post filter with a high cost ( so it is looked up last). This would look up docs to remove in an external source (redis, db...).
In my opinion no matter how much the saw_me field grows, it will not make much difference in search time.Because tokens are indexed inversely and doc_values are created at index time in column major fashion for efficient read and has support for caching from OS. ES handles these things for you efficiently.
I am familiar with how to sort GSA results on metadata.
I'm interested in sorting across multiple indexes.
For example, sort by Last Name, then by First Name.
So that Alice Smith appears before Bob Smith.
In SQL, this would be quite simple, equivalent to:
SELECT value FROM table ORDER BY last, first
Does GSA support this?
I've been playing with a few different syntaxes, but haven't found a way yet.
If it's only possible to sort on one index, how does google sort within the set of equivalent results? e.g. How does GSA determine whether Alice or Bob appears first? I can't find any good explanation on this.
Sorry if I post it as answer but I can't comment your question because of my reputation is still too low.. (wtf stackoverflow!?).
I just wanna know if you find a way to solve this problem. Thank you!
From what I can tell, GSA does not support multiple dependent sort order.
Instead, I've built an additional meta index that combines the two indexes I want to sort.
So, for example, I have index A for "First Name", index B for "Last Name", and index C which is the combination of both values into "Last Name"_"First Name".
This seems to be working well for me so far.
there are usecases where I really would like to know which term was matched in which field by my search. With this information I would like to disclose the information which field caused the hit to the user on my webpage. I also would like to know the term playing part in the hit. In my case it is a database identifier, so I would take the matched term - an ID - get the respective database record and display useful information to the user.
I currently know two ways: Highlighting and the explain API. However, the first requires stored values which seems unnecessary. The second is meant for debugging only and is rather expensive so I wouldn't want it to run with every query.
I don't know another way which is confusing: The highlighting algorithms need the information I want to use anyway, can't I just get it somehow?
On a related note, I would also be interested in the opposite case: Which term did not hit at all? This information would allow for features like "terms that didn't match your query" like Google does sometimes (where the respective words are shown in grey-strikeout).
Thanks for hints!