Web search algorithm with multiple words

Web search algorithm with multiple words - algorithm

I want to use search from database on my website, so I think about effective algorithm to use.
For example if I try to search "Hello my name is xxx" I want to see results:
Hello my name is John
Hello my name is Peter
Hello mr. xxx
His name is Peter
He is here
So I want to search all data from database with part of this text and sort result by number of matching words.
I made algorithm but I am pretty scared that it's so complicated and slow:
I split search text into words and use SQL select with multiple like or commands. Then I save this results into list. Then I count up numbers of matched words in each result and sort it by this count.
Problem is that when I will try to search long text.
Should I use better algorithm or should I learn somethink about thinks like Sphinx

For the first two results, a simple regex search should be able to retrieve results like that.
For the later ones, you might consider using an existing searching library thing, like Google Search Appliance, which can be used to search database information.

Related

Related search words in Elasticsearch

I'm looking for a convenient way to search for related words to a term. For example, If I search for the word "washer", I should be getting related search terms like "dryer" with the lower score thank washer results, It means the washer documents must appear the first and then dryer documents. how can I do this functionality?

You need to build a synonym dictionary. Fortunately, we have machine learning models now, like "word2vec (neural net)", that can do this. You can try using open source gensim package for this.
The input to the model would be lots of text/info/articles that carries the word washer and dryer. once you train on this, you can find closest words that are related to "washer" and use these as synonym like dictionary.
At query time, look up this dictionary and expand the query with lower weight/boost for synonyms than the actual term.

Google Search Appliance sort by metadata content

I'm trying to refine the search results received by my application by including the sort parameter in my HTTP requests. I've combed through the documentation here, but I can't find exactly what I'm looking for.
I'm searching for DOC filetypes, and I am able to sort by date or sort by metadata, as in alphabetizing by title, author, etc. I can also filter by whether or not the title contains certain keywords. What I want to do is to sort by whether or not the title contains certain keywords (these documents appearing first in the results), but to still keep the other results.
For example, with keywords [winter, Christmas, holiday] I could do a descending sort by the sum of inmeta:title~winter, inmeta:title~Christmas, inmeta:title~holiday and the top result might be
Winter holidays other than Christmas
followed by documents with one or two of the keywords, followed by documents that meet the other search parameters but contain no keywords.
Is this possible in GSA?

I finally achieved what I was trying to do, so figured I'd post in case it helps anyone else.
As far as I know, it is impossible to create a query with this capability, but with Google's Custom Search API, you can create a search engine with the desired keywords in the context file (by editing the XML file directly or by adding keywords through the CSE console). Then you can formulate the query as usual, but perform the search on your personalized engine.
https://developers.google.com/custom-search/docs/ranking

GSA sorting with over multiple metadata indexes

I am familiar with how to sort GSA results on metadata.
I'm interested in sorting across multiple indexes.
For example, sort by Last Name, then by First Name.
So that Alice Smith appears before Bob Smith.
In SQL, this would be quite simple, equivalent to:
SELECT value FROM table ORDER BY last, first
Does GSA support this?
I've been playing with a few different syntaxes, but haven't found a way yet.
If it's only possible to sort on one index, how does google sort within the set of equivalent results? e.g. How does GSA determine whether Alice or Bob appears first? I can't find any good explanation on this.

Sorry if I post it as answer but I can't comment your question because of my reputation is still too low.. (wtf stackoverflow!?).
I just wanna know if you find a way to solve this problem. Thank you!

From what I can tell, GSA does not support multiple dependent sort order.
Instead, I've built an additional meta index that combines the two indexes I want to sort.
So, for example, I have index A for "First Name", index B for "Last Name", and index C which is the combination of both values into "Last Name"_"First Name".
This seems to be working well for me so far.

Sphinx reverse search - when new item is added, execute searches on existing stored keywords

I have an app where people can list stuff to sell/swap/give away, with 200-character descriptions. Let's call them sellers.
Other users can search for things - let's call them buyers.
I have a system set up using Django, MySQL and Sphinx for text search.
Let's say a buyer is looking for "t-shirts". They don't get any results they want. I want the app to give the buyer the option to check a box to say "Tell me if something comes up".
Then when a seller lists a "Quicksilver t-shirt", this would trigger a sort of reverse search on all saved searches to notify those buyers that a new item matching their query has been listed.
Obviously I could trigger Sphinx searches on every saved search every time any new item is listed (in a loop) to look for matches - but this would be insane and intensive. This is the effect I want to achieve in a sane way - how can I do it?

You literally build a reverse index!
Store the 'searches' in the databases, and build an index on it.
So 't-shirts' would be a document in this index.
Then when a new product is submitted, you run a query against this index. Use 'Quorum' syntax or even match-any - to get matches that only match one keyword.
So in your example, the query would be "Quicksilver t-shirt"/1 which means match Quicksilver OR t-shirt. But the same holds with much longer titles, or even the whole description.
The result of that query would be a list of (single word*) original searches that matched. Note this also assumes you have your index setup to treat - as a word char.
*Note its slightly more complicated if you allow more complex queries, multi keywords, or negations and an OR brackets, phrases etc. But in this case the reverse search jsut gives you POTENTIAL matches, so you need to confirm that it still matches. Still a number of queries, but you you dont need to run it on all
btw, I think the technical term for these 'reverse' searches is Prospective Search
http://en.wikipedia.org/wiki/Prospective_search

Searching a column with comma separated values in Oracle

I am using Oracle 11g and Oracle Text for a web search engine.
I have now created & text-indexed a CLOB column Keywords which contains space-separated words. This allowed me to extend the search, as Oracle Text will return the rows that have one or more keywords stored in that column. The contents of the column are hidden from the user, and is used only to "extend" the search. This is working as intended.
But now I need to support multiple words or even complete sentences. With the current configuration, Oracle Text will search only for individual keyword. How do I need to store the phrases and configure Oracle Text so that it will search for whole phrases (exact match is preferred, but fuzzy matching is fine too)?
Column content example of two rows(semi-colon seperated values):
"hello, hello; is there anybody out there?; nope;"
"just the; basic facts;"
I found a similar question: Searching a column with comma separated values, except that I need a solution for Oracle 11g with it's freetext search functionality.
Possible solutions:
1st solution: I was thinking of redesigning the DB as follows. I'd make a new table Keywords(pkID NUMBER, nonUniqueID NUMBER, singlePhrase VARCHAR2(100 BYTE)). And I'd change the previous column Keyword to KeywordNonUniqueID, which would hold the ID (instead of a list of values). At search-time I'd INNER JOIN with the new Keyword table. The problem with this solution is that I'll get multiple rows that contains the same data except the phrase. I assume this will destroy the ranking?
2nd solution: Is it possible to store phrases as a XML in the original Keyword column, and somehow tell Oracle Text to search within the XML?
3rd solution: ?
Note that, generally, there won't be a lot of phrases (less than 100), nor will they be long (a single phrase will have up to 5 words).
Also note that I am currently using CONTAINS, and a few of its operators, for my full-text searching needs.
EDIT: This https://forums.oracle.com/forums/thread.jspa?messageID=10791361 discussion that almost solves my problem, but it also matches the individual words, not the whole phrase (exact matching).

Oracle supports searching of phrases by default.
In docs we can see this
4.1.4.1 CONTAINS Phrase Queries
If multiple words are contained in a query expression, separated only
by blank spaces (no operators), the string of words is considered a
phrase and Oracle Text searches for the entire string during a query.
For example, to find all documents that contain the phrase
international law, enter your query with the phrase international law.
Did I answer your question or misunderstand you?
P.S. It seems to me that the solution is convert
"hello, hello; is there anybody out there?; nope;"
"just the; basic facts;"
to
"hello, hello aa is there anybody out there? aa nope aa" "just the aa basic
facts aa"
and search with CONTAINS for the phrase "is there anybody out there? aa"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio