Sphinx reverse search - when new item is added, execute searches on existing stored keywords

I have an app where people can list stuff to sell/swap/give away, with 200-character descriptions. Let's call them sellers.
Other users can search for things - let's call them buyers.
I have a system set up using Django, MySQL and Sphinx for text search.
Let's say a buyer is looking for "t-shirts". They don't get any results they want. I want the app to give the buyer the option to check a box to say "Tell me if something comes up".
Then when a seller lists a "Quicksilver t-shirt", this would trigger a sort of reverse search on all saved searches to notify those buyers that a new item matching their query has been listed.
Obviously I could trigger Sphinx searches on every saved search every time any new item is listed (in a loop) to look for matches - but this would be insane and intensive. This is the effect I want to achieve in a sane way - how can I do it?

You literally build a reverse index!
Store the 'searches' in the databases, and build an index on it.
So 't-shirts' would be a document in this index.
Then when a new product is submitted, you run a query against this index. Use 'Quorum' syntax or even match-any - to get matches that only match one keyword.
So in your example, the query would be "Quicksilver t-shirt"/1 which means match Quicksilver OR t-shirt. But the same holds with much longer titles, or even the whole description.
The result of that query would be a list of (single word*) original searches that matched. Note this also assumes you have your index setup to treat - as a word char.
*Note its slightly more complicated if you allow more complex queries, multi keywords, or negations and an OR brackets, phrases etc. But in this case the reverse search jsut gives you POTENTIAL matches, so you need to confirm that it still matches. Still a number of queries, but you you dont need to run it on all
btw, I think the technical term for these 'reverse' searches is Prospective Search


Google Search Appliance sort by metadata content

I'm trying to refine the search results received by my application by including the sort parameter in my HTTP requests. I've combed through the documentation here, but I can't find exactly what I'm looking for.
I'm searching for DOC filetypes, and I am able to sort by date or sort by metadata, as in alphabetizing by title, author, etc. I can also filter by whether or not the title contains certain keywords. What I want to do is to sort by whether or not the title contains certain keywords (these documents appearing first in the results), but to still keep the other results.
For example, with keywords [winter, Christmas, holiday] I could do a descending sort by the sum of inmeta:title~winter, inmeta:title~Christmas, inmeta:title~holiday and the top result might be
Winter holidays other than Christmas
followed by documents with one or two of the keywords, followed by documents that meet the other search parameters but contain no keywords.
Is this possible in GSA?
I finally achieved what I was trying to do, so figured I'd post in case it helps anyone else.
As far as I know, it is impossible to create a query with this capability, but with Google's Custom Search API, you can create a search engine with the desired keywords in the context file (by editing the XML file directly or by adding keywords through the CSE console). Then you can formulate the query as usual, but perform the search on your personalized engine.

Good way to exclude records in SOLR or Elasticsearch

For a matchmaking portal, we have one requirement where in, if a customer viewed complete profile details of a bride or groom then we have to exclude that profile from further search results. Currently, along with other detail we are storing the viewed profile ids in a field (Comma Separated) against that bride or groom's details.
Eg., if A viewed B, then in B's record under the field saw_me we will add A (comma separated).
while searching let say the currently searching members id is 123456 then we will fire a query like
Select * from profiledetails where (OTHER CON) AND 123456 not in saw_me;
The problem here is the saw_me field value is growing like anything, is there any better way to handle this requirement? Please guide.
If this is using Solr:
first, DON'T add the 'AND NOT ...' clauses along with the main query in q param, add them to fq. This have many benefits (the fq will be cached)
Until you get to a list of values that is maybe 1000s this approach is simple and should work fine
After you reach a point where the list is huge, maybe it time to move to a post filter with a high cost ( so it is looked up last). This would look up docs to remove in an external source (redis, db...).
In my opinion no matter how much the saw_me field grows, it will not make much difference in search time.Because tokens are indexed inversely and doc_values are created at index time in column major fashion for efficient read and has support for caching from OS. ES handles these things for you efficiently.

ElasticSearch: is it possible to highlight words in the query rather than the results

We use ElasticSearch in a reverse manner from what I usually see. We store lots of small documents, usually 1 or 2 words, for example, Job Titles like "software engineering", "car mechanics", "architect", etc.
Then we query with a longer string, for example a 1000 word Job Spec. This way we get all Job Titles present in the text of the Job Spec.
It works well. But I was wondering whether I could get ElasticSearch to highlight the matching Job Titles in the Job Spec, i.e. highlight the results in the query. I have tried the highlight keyword, but it doesn't highlight the query text, it highlights the results. I'm not sure how to get the query to be returned in the ElasticSearch response, let alone whether it can be highlighted.
You might wonder why I need ElasticSearch to highlight the query, can't I just pick out all the results from the text and highlight them myself? Yes I can, but there's various things to think about that makes it hard such as stemming and stopword removal. for example "jquery" is stemmed to "jqueri" when doing the tokenising in ElasticSearch, so it's found as a result, but if I want to highlight it myself, I have to unstem it so it matches the original text. Elasticsearch also removes symbols, so terms & conditions would become terms conditions which is problematic if I want to highlight it manually as I have to add back the "&" symbol. There's a hundred other problem cases, hence the question about whether ElasticSearch can do it for me.
I'm quite sure highlighting the query string isn't possible - only highlighting parts of documents in an index.
What you might try is indexing the query string itself in it's own index and then using the results of the first query as the query terms for a second query against the query string (in the second index). You could then have highlighting on the query string. You'll have to make an extra request to ES each time, but I think it'll get what you want.

ElasticSearch: A way to know which term hit in which field?

there are usecases where I really would like to know which term was matched in which field by my search. With this information I would like to disclose the information which field caused the hit to the user on my webpage. I also would like to know the term playing part in the hit. In my case it is a database identifier, so I would take the matched term - an ID - get the respective database record and display useful information to the user.
I currently know two ways: Highlighting and the explain API. However, the first requires stored values which seems unnecessary. The second is meant for debugging only and is rather expensive so I wouldn't want it to run with every query.
I don't know another way which is confusing: The highlighting algorithms need the information I want to use anyway, can't I just get it somehow?
On a related note, I would also be interested in the opposite case: Which term did not hit at all? This information would allow for features like "terms that didn't match your query" like Google does sometimes (where the respective words are shown in grey-strikeout).
Thanks for hints!

Exact phrase search using lucene without increasing number of fields

For a phrase search, we want to bring up results only if there's an exact match (without ignoring stopwords). If it's a non-phrase search, we are fine displaying results even if the root form of the word matches etc.
We currently pass our data through standardTokenizer, StopFilter, PorterStemFilter and LowerCaseFilter. Due to this when user wants to search for "password management", search brings up results containing "password manager".
If I remove StemFilter, then I will not be able to match for the root form of the word for non-phrase queries. I was thinking if I should index the same data as part of two fields in document.
I have asked same question at Different indexing and search strategies on same field without doubling index size?. However folks at office are not happy about indexing the same data as part of two fields. (we currently have around 20 text fields in lucene document). Is there any way to support both the cases I listed above using TokenFilters?
Say, for a StopFilter, make changes so that it emits both the input token and ? (for ignored word) with same position increments. Similarly for StemFilter, it emits both the input token and stemmed token with same position increments. Basically input and output tokens (even ignored ones) have same positions.
Is it safe to go ahead with this approach? Has anyone else faced the requirements listed here? Are there any Filters readily available which do something similar to what I mentioned in my approach?
I don't understand what you mean by "input and output tokens." Are you storing the data twice - once as stemmed and once non-stemmed?
If you aren't storing it twice, I don't think your method will work. Suppose the stored word is jumping and they search for jumped. Your query parser can emit jump and jumped but it still won't match jumping unless you have a value stored as jump.
And if you're going to store the value once as stemmed and once as non-stemmed, then why not just store it in two fields? Then you won't have to deal with weird tokenizer changes.
