Performance of Salesforce duplicate matching with fuzzy match criteria - performance

We're using Salesforce for Master Data Management (MDM). We're intending to have a custom Apex REST API to check for duplicate account (PersonAccount) objects using the findDuplicates method from the Datacloud namespace.
If we have a Duplicate Rule that has multiple Matching Rules with each Matching Rule having one or more matching criteria (some of which will probably be fuzzy matching criteria) will there be performance issues if we're trying to find duplicates over 1 million records?
For the sake of this discussion lets say there's a Duplicate Rule with 5 Matching Rules and each Matching Rule has 2 matching criterias and in 3 of the Matching Rules one of the criteria is a fuzzy match e.g.
(Exact Mobile AND Exact Postcode) OR
(Fuzzy Email AND Exact Postcode) OR
(Fuzzy Surname AND Exact DOB) OR
(Exact Mobile AND Fuzzy Surname) OR
(Exact DOB AND Exact Mobile)
Will there be performance issues if we're trying to find duplicates over 1 million records?
Hoping this will be microseconds and not something like 20 seconds as the invocation of the custom Apex REST API to check for duplicates will be block the UI. I suspect the fuzzy matching which might use distance algorithms would be hard or impossible to index and as such probably wouldn't be very performant.

Related

'loosen up' Laravel Scout's search functionality to return more results with database driver

I'm using Laravel Scout with the driver set as database to create a searchable knowlegde base.
The issue i've currently got it that I would like each word in the search term to be searched for an return all results that have any of the words rather than an exact match.
So for example:
search term: pay a bill
database records contain: pay my bill, how to pay a bill, paying my bill
With my current setup, it wont return any of these results as none of them match the exact term searched.
Is there anything I can do in Scout to make it loosely search each word?
I can achieve this by using whereFullText() but I wanted to keep this within Scout and using ::search() if possible.

Elasticsearch - match by all terms but full field must be matched

I'm trying to improve search on my service but get stuck on complex queries.
I need to match some documents by terms but return only documents that contains all of provided terms in any order and contains only this terms.
So for example, lets take movie titles:
"Jurassic Park"
"Lost World: Jurassic Park"
"Jurassic Park III"
When I type "Park Jurassic" I want only first document to be returned because it contains both words and nothing more.
This is silly example of complex problem but I've simplified it.
I tried with terms queries, match etc but I don't know how to check if entire field was matched.
So in short it must match all tokens in any order.
Field is mapped as text and also as keyword.
You tested the terms set query?
Returns documents that contain a minimum number of exact terms in a
provided field.
The terms_set query is the same as the terms query, except you can
define the number of matching terms required to return a document.

How to use minimum_should_match to search in multiple fields with multiple search terms?

I'm new at elasticsearch so if it's a simple question please forgive me.
I have lets say 6 keywords and i want to search these in multiple fields and get documents if at least 4 of them in every documents. In addition to that i may want to separate 1 of them and say this must be in the document. How should it be implemented?
I'm doing half of these with minimum_should_match but i want to broaden it with both multiple field and with a must one.

Google Search Appliance sort by metadata content

I'm trying to refine the search results received by my application by including the sort parameter in my HTTP requests. I've combed through the documentation here, but I can't find exactly what I'm looking for.
I'm searching for DOC filetypes, and I am able to sort by date or sort by metadata, as in alphabetizing by title, author, etc. I can also filter by whether or not the title contains certain keywords. What I want to do is to sort by whether or not the title contains certain keywords (these documents appearing first in the results), but to still keep the other results.
For example, with keywords [winter, Christmas, holiday] I could do a descending sort by the sum of inmeta:title~winter, inmeta:title~Christmas, inmeta:title~holiday and the top result might be
Winter holidays other than Christmas
followed by documents with one or two of the keywords, followed by documents that meet the other search parameters but contain no keywords.
Is this possible in GSA?
I finally achieved what I was trying to do, so figured I'd post in case it helps anyone else.
As far as I know, it is impossible to create a query with this capability, but with Google's Custom Search API, you can create a search engine with the desired keywords in the context file (by editing the XML file directly or by adding keywords through the CSE console). Then you can formulate the query as usual, but perform the search on your personalized engine.
https://developers.google.com/custom-search/docs/ranking

Sphinx reverse search - when new item is added, execute searches on existing stored keywords

I have an app where people can list stuff to sell/swap/give away, with 200-character descriptions. Let's call them sellers.
Other users can search for things - let's call them buyers.
I have a system set up using Django, MySQL and Sphinx for text search.
Let's say a buyer is looking for "t-shirts". They don't get any results they want. I want the app to give the buyer the option to check a box to say "Tell me if something comes up".
Then when a seller lists a "Quicksilver t-shirt", this would trigger a sort of reverse search on all saved searches to notify those buyers that a new item matching their query has been listed.
Obviously I could trigger Sphinx searches on every saved search every time any new item is listed (in a loop) to look for matches - but this would be insane and intensive. This is the effect I want to achieve in a sane way - how can I do it?
You literally build a reverse index!
Store the 'searches' in the databases, and build an index on it.
So 't-shirts' would be a document in this index.
Then when a new product is submitted, you run a query against this index. Use 'Quorum' syntax or even match-any - to get matches that only match one keyword.
So in your example, the query would be "Quicksilver t-shirt"/1 which means match Quicksilver OR t-shirt. But the same holds with much longer titles, or even the whole description.
The result of that query would be a list of (single word*) original searches that matched. Note this also assumes you have your index setup to treat - as a word char.
*Note its slightly more complicated if you allow more complex queries, multi keywords, or negations and an OR brackets, phrases etc. But in this case the reverse search jsut gives you POTENTIAL matches, so you need to confirm that it still matches. Still a number of queries, but you you dont need to run it on all
btw, I think the technical term for these 'reverse' searches is Prospective Search
http://en.wikipedia.org/wiki/Prospective_search

Resources