Fast fulltext comparation of two databases - full-text-search

I have 2 databases with product data. Data in both presented in third normal form and tables have the following fields:
id, FullName, AttributeName, AttributeValue
So, there are many rows (attributes) for every id (product).
I need to find relevant products (with relevance value) from first DB for every product from second DB. Comparation should be structed (I need to compare both names and attributes).
Comparation by FullName and AttributeName (both are strings) between two products should be performed using fulltext search or some kind of fuzzy comparation (may be some embeddings).
I have tens of millions products in first database and millions of products in second. Products could be added or deleted from both databases. If we had new product in first database, we need to calculate relevance of every product in second database with it, and if we had new product in second one we could perform search query on all records in first one.
Because of number of products, I look towards fulltext search engines like Sphinx, ElasticSearch of Apache Solr.
But question is could I calculate relevance of all products in second DB with some new products in first DB not performing "bruteforce querying" (perform search using every product from second DB as query)? May be there is some "inverted relevance search" in such engines, or some else engine.
I use Python as a programming language in my system, so engine should have API I could use from Python.

More than a month late, but if you are still on this, maybe you can check this - Manticore Percolate
I am not sure if I understand your question properly.

Related

In Elasticsearch, how can I retrieve products grouped by the store that sells them?

I've got a bunch of stores, each of which sells several products, and those products have descriptions. I would like to build a search experience where the user can search for products by words in the description, and have a search result page where matching products are shown, grouped by the store that sells them. My question is:
How can I design an efficient Elasticsearch schema and query scheme that will let me query for products with the results grouped by store, with the guarantee that every store in the search results contains a complete list of items that match the query?
For instance, suppose I had the following data:
Store 1
Product 1a, description: "Peanut butter and jelly sandwich"
Product 1b, description: "Taco"
Product 1c, description: "Sandwich holder"
Store 2
Product 2a, description: "Burrito bowl"
Store 3
Product 3a, description: "Sandwich maker"
Product 3b, description: "Sandwich bread"
Product 3c, description: "Salad tongs"
In my overall application, I want a query for "sandwich" to return something like:
Store 1
product 1a
product 1c
Store 3
product 3a
product 3b
Whenever I show a store, I always want to show all hits for that store. In the domain I'm working in, there are lots of stores but each store only has a small number of products (max of around 10-20, with most stores only having 2 or 3).
I can see two ways to implement this, and both seem bad to me.
Approach #1
Index each product is a separate document. Then at query time, I could fetch every matching document and post-process them in Java to group them by store, and finally return that result. The problems I see with this approach are:
I can't use any kind of ranking, since I'm going to re-sort the results.
I also can't do any limiting; I have to fetch every single document, no matter how many there may be, since otherwise I can't guarantee that I have every product for a particular store. This will result in lots of wasted work.
Approach #2
Index each store as a separate document, with a nested field holding each product. At query time, I could retrieve stores where the product description nested field has a match on the search term. Then, once I have the stores I want to show, I'd have to run a separate query to fetch the matching products from those stores. The problems with this approach are:
I'm asking elasticsearch to do more work than necessary; internally, it had find everything I needed in the first query, but I'm asking a second query anyway
Issuing two related queries complicates the code and requires me to keep two queries in sync (e.g. I need to make sure that the documents matched in query 1 as subfields are the same documents that query 2 matches)
Can anyone more experienced with Elasticsearch than I am see a better option?
With Approach#2 I see 2 options:
Nested inner hits.
You could use top_hits with reverse_nested aggregator. You'll search for the products in query and you'll group the docs by store in the aggregator. The top_hits aggregation returns regular search hits meaning you'll get the children(products) along with the parent(store).

Good way to exclude records in SOLR or Elasticsearch

For a matchmaking portal, we have one requirement where in, if a customer viewed complete profile details of a bride or groom then we have to exclude that profile from further search results. Currently, along with other detail we are storing the viewed profile ids in a field (Comma Separated) against that bride or groom's details.
Eg., if A viewed B, then in B's record under the field saw_me we will add A (comma separated).
while searching let say the currently searching members id is 123456 then we will fire a query like
Select * from profiledetails where (OTHER CON) AND 123456 not in saw_me;
The problem here is the saw_me field value is growing like anything, is there any better way to handle this requirement? Please guide.
If this is using Solr:
first, DON'T add the 'AND NOT ...' clauses along with the main query in q param, add them to fq. This have many benefits (the fq will be cached)
Until you get to a list of values that is maybe 1000s this approach is simple and should work fine
After you reach a point where the list is huge, maybe it time to move to a post filter with a high cost ( so it is looked up last). This would look up docs to remove in an external source (redis, db...).
In my opinion no matter how much the saw_me field grows, it will not make much difference in search time.Because tokens are indexed inversely and doc_values are created at index time in column major fashion for efficient read and has support for caching from OS. ES handles these things for you efficiently.

Sort by a different index's values

Given two indexes, I'm trying to sort the first based on values of the second.
For example, Index 1 ('Products') has fields id, name. Index 2 ('Prices') has fields id, price.
Struggling to figure out how to sort 'Products' by the 'Prices'.price, assuming the ids match. Reason for this quest is that hypothetically the 'Products' index becomes very large (with duplicate ids), and updating all documents becomes expensive.
Elasticsearch is a document based store, rather than a column based store. What you're looking for is a way to JOIN the two indices, however this is not supported in Elasticsearch. The 'Elasticsearch way' of storing these documents is to have 1 index that contains all relevant data. If you're worried about update procedures taking very long, look into creating an index with an Alias. When you need to do a major update, do it to a new index and only when you're done switch the alias target to the new index, this will allow you to update you data seamlessly

Efficient way for sorting items for different parameters?

Suppose you have millions items(say search results) and you have different parameters for sorting these items(like in eCommerce sites). We will be showing the items using pagination.
Let us say it can be sorted by date, popularity and relevance and results are paginated. How would you implement this functionality? Generally I would create different compare functions for parameters and get results accordingly.
If there any other efficient way to have this kind of functionality instead of sorting the search results every time? Also, do we generally run sql query every time using relevant order parameter or should we sort the search result of previous query to save us from re-searching time?
"If there any other efficient way to have this kind of functionality instead of sorting the search results every time?"
I would say you do not need sort every time but execute SQL query with appropriate OrderBy parameter, paginate it and show to the user
"Also, do we generally run sql query every time using relevant order parameter or should we sort the search result of previous query to save us from re-searching time?"
For sure you need to generate a new SQL query, as the first page data based on a new order parameter can contain completely different set of data from previouse one.

Lucene.NET: Query or Filter?

It is my understanding that documents are found based on a query, and then that result is then filtered by the filter.
The Query is the only that that will effect the score/relevance of a document.
Would there be any performance (caching) improvements if I query results that have relevance towards relevancy, and filter items that don't?
Here is my situation. I have a lot of products, and the website will often search for products by category or manufacturer. I was thinking about using queries for that as that will bring the products down to a smaller subset which can be cached. I can then filter my results by product specifications. Should I use filters for specifications? That way we can filter based on an already cached (by lucene) subset of products (category or manufacturer).
Using filters also does not affect the returned score whereas additional terms in a query do. You should use filters, for example, if a user picks a certain category from a list of available categories as facets :
Category : Electricals
Query Terms : DSLR Camera
Resultant scores (relevancy) are based on the query terms other than a hit on the category
The difference between filter and query is mostly that filter is exact. If you filter on brand=... than you will only get that exact brand. If you query on it, you will get the brand and possibly other results that also match your query.
So the question is, do you want an exact filter, or is it just for relevance?
Filtering provides a mechanism to further restrict the results of a query and provide a possible performance gain if the same query is run multiple times.
We mostly use filters for security - this would provide performance gains as results of the query are cached.

Resources